Re: Welcome Karl Wright as a Lucene/Solr committer!

2016-04-04 Thread Han Jiang
Welcome Karl!

On Mon, Apr 4, 2016 at 3:40 PM, Karl Wright <daddy...@gmail.com> wrote:

> Hi all,
>
> Professionally, I've been active in software development since the
> 1970's.  My interests include many things related to software development,
> as well as areas as varied as geology, carpentry, and gardening.  I'm the
> PMC chair for the ManifoldCF project, as well as a committer on other
> Apache projects such as Http Components.
>
> My current employer is HERE, Inc, who is a spin-off from Nokia, who sells
> map data, services, and search capabilities.
>
> I'm also the contributor and principal author of the Geo3D package, which
> is now part of Lucene under the spatial3d module.  I intend to continue to
> contribute to this package for the foreseeable future.
>
> Thanks!!
> Karl
>
>
> On Mon, Apr 4, 2016 at 10:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> I'm pleased to announce that Karl Wright has accepted the Lucene PMC's
>> invitation to become a committer.
>>
>> Karl, it's tradition that you introduce yourself with a brief bio.
>>
>> Karma has been granted to your pre-existing account, so that you can
>> add yourself to the committers section of the Who We Are page on the
>> website: http://lucene.apache.org/whoweare.html
>>
>> Congratulations and welcome!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>
>


-- 
Han Jiang


Re: Welcome Dennis Gove as Lucene/Solr committer

2015-11-06 Thread Han Jiang
Welcome Dennis!

On Fri, Nov 6, 2015 at 3:19 PM, Joel Bernstein <joels...@gmail.com> wrote:

> I'm pleased to announce that Dennis Gove has accepted the PMC's
> invitation to become a committer.
>
> Dennis, it's tradition that you introduce yourself with a brief bio.
>
> Your account is not entirely ready yet. We will let you know when it is
> created
> and karma has been granted so that you can add yourself to the committers
> section of the Who We Are page on the website:
> <http://lucene.apache.org/whoweare.html>.
>
> Congratulations and welcome!
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Nick Knize as Lucene/Solr committer

2015-10-20 Thread Han Jiang
Welcome Nick!

On Wed, Oct 21, 2015 at 12:50 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I'm pleased to announce that Nick Knize has accepted the PMC's
> invitation to become a committer.
>
> Nick, it's tradition that you introduce yourself with a brief bio /
> origin story, explaining how you arrived here.
>
> Your handle "nknize" has already added to the “lucene" LDAP group, so
> you now have commit privileges.
>
> Please celebrate this rite of passage, and confirm that the right
> karma has in fact enabled, by embarking on the challenge of adding
> yourself to the committers section of the Who We Are page on the
> website: http://lucene.apache.org/whoweare.html (use the ASF CMS
> bookmarklet
> at the bottom of the page here: https://cms.apache.org/#bookmark -
> more info here http://www.apache.org/dev/cms.html).
>
> Congratulations and welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Christine Poerschke as Lucene/Solr committer

2015-07-24 Thread Han Jiang
Welcome, Christine!

On Fri, Jul 24, 2015 at 3:27 PM, Adrien Grand jpou...@gmail.com wrote:

 I'm pleased to announce that Christine Poerschke has accepted the PMC's
 invitation to become a committer.

 Christine, it's tradition that you introduce yourself with a brief bio.

 Your account is not entirely ready yet. We will let you know when it is
 created
 and karma has been granted so that you can add yourself to the committers
 section of the Who We Are page on the website:
 http://lucene.apache.org/whoweare.html.

 Congratulations and welcome!

 --
 Adrien

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Mikhail Khludnev as Lucene/Solr committer

2015-07-21 Thread Han Jiang
Welcome, Mikhail!

On Tue, Jul 21, 2015 at 3:21 PM, Adrien Grand jpou...@gmail.com wrote:

 I'm pleased to announce that Mikhail Khludnev has accepted the PMC's
 invitation to become a committer.

 Mikhail, it's tradition that you introduce yourself with a brief bio.

 Your handle mkhl has already added to the “lucene LDAP group, so
 you now have commit privileges. Please test this by adding yourself to
 the committers section of the Who We Are page on the website:
 http://lucene.apache.org/whoweare.html (use the ASF CMS bookmarklet
 at the bottom of the page here: https://cms.apache.org/#bookmark -
 more info here http://www.apache.org/dev/cms.html).

 Congratulations and welcome!

 --
 Adrien

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Upayavira as Lucene/Solr committer

2015-06-22 Thread Han Jiang
Welcome, Upayavira!

On Tue, Jun 23, 2015 at 3:02 AM, Steve Rowe sar...@gmail.com wrote:

 I'm pleased to announce that Upayavira has accepted the PMC's invitation
 to become a committer.

 Upayavira, it's tradition that you introduce yourself with a brief bio.

 Mike McCandless, the Lucene PMC chair, has already added your “upayavira
 account to the “lucene LDAP group, so you now have commit privileges.
 Please test this by adding yourself to the committers section of the Who We
 Are page on the website: http://lucene.apache.org/whoweare.html (use
 the ASF CMS bookmarklet at the bottom of the page here: 
 https://cms.apache.org/#bookmark - more info here 
 http://www.apache.org/dev/cms.html).

 Congratulations and welcome!

 Steve
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Gregory Chanan as Lucene/Solr committer

2014-09-19 Thread Han Jiang
Welcome Gregory!

On Sat, Sep 20, 2014 at 9:26 AM, Ryan Ernst r...@iernst.net wrote:
 Welcome Gregory!

 On Sep 19, 2014 3:33 PM, Steve Rowe sar...@gmail.com wrote:

 I'm pleased to announce that Gregory Chanan has accepted the PMC's
 invitation to become a committer.

 Gregory, it's tradition that you introduce yourself with a brief bio.

 Mark Miller, the Lucene PMC chair, has already added your gchanan
 account to the lucene LDAP group, so you now have commit privileges.
 Please test this by adding yourself to the committers section of the Who We
 Are page on the website: http://lucene.apache.org/whoweare.html (use the
 ASF CMS bookmarklet at the bottom of the page here:
 https://cms.apache.org/#bookmark - more info here
 http://www.apache.org/dev/cms.html).

 Since you're a committer on the Apache HBase project, you probably already
 know about it, but I'll include a link to the ASF dev page anyway - lots of
 useful links: http://www.apache.org/dev/.

 Congratulations and welcome!

 Steve


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5841) Remove FST.Builder.FreezeTail interface

2014-07-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071563#comment-14071563
 ] 

Han Jiang commented on LUCENE-5841:
---

It is really great to see this interface removed!

 Remove FST.Builder.FreezeTail interface
 ---

 Key: LUCENE-5841
 URL: https://issues.apache.org/jira/browse/LUCENE-5841
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.10

 Attachments: LUCENE-5841.patch


 The FST Builder has a crazy-hairy interface called FreezeTail, which is only
 used by BlockTreeTermsWriter to find appropriate prefixes
 (i.e. containing enough terms or sub-blocks) to write term blocks.
 But this is really a silly abuse ... it's cleaner and likely
 faster/less GC for BTTW to compute this itself just by tracking the
 term ordinal where each prefix started in the pending terms/blocks.  The
 code is also insanely hairy, and this is at least a baby step to try
 to make it a bit simpler.
 This also makes it very hard to experiment with different formats at
 write-time because you have to get your new formats working through
 this strange FreezeTail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang closed LUCENE-3069.
-


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2014
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang resolved LUCENE-3069.
---

Resolution: Fixed

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang closed LUCENE-3069.
-


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang reopened LUCENE-3069:
---


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Labels: gsoc2013  (was: gsoc2014)

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937095#comment-13937095
 ] 

Han Jiang commented on LUCENE-3069:
---

Had to reopen it because jira doesn't permit label change :)

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Anshum Gupta as Lucene/Solr Committer!

2014-02-16 Thread Han Jiang
Welcome Anshum!


On Mon, Feb 17, 2014 at 6:33 AM, Mark Miller markrmil...@gmail.com wrote:

 Hey everybody!

 The Lucene PMC is happy to welcome Anshum Gupta as a committer on the
 Lucene / Solr project.  Anshum has contributed to a number of issues for
 the project, especially around SolrCloud.

 Welcome Anshum!

 It's tradition to introduce yourself with a short bio :)

 --
 - Mark

 http://about.me/markrmiller




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Resolved] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-01-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang resolved LUCENE-3069.
---

Resolution: Fixed

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2014
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-01-29 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886256#comment-13886256
 ] 

Han Jiang commented on LUCENE-3069:
---

Thanks Mike!

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2014
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Benson Margulies as Lucene/Solr committer!

2014-01-25 Thread Han Jiang
Welcome Benson!


On Sun, Jan 26, 2014 at 6:10 AM, Benson Margulies bimargul...@gmail.comwrote:

 Hello Lucene development community, it's a pleasure to be welcomed aboard.

 In my view, the significant aspect of my bio is that I've been
 implementing things that go into or around Lucene for many years now.
 During the 'day', I'm the CTO of a company that works in the area of
 text analytics. We build Tokenizers and TokenFilters to allow our
 users to integrate our components into Lucene, and we've used Lucene
 and Solr as components of NLP devices that search on a large scale. So
 I have an abiding interest in the analysis chain and in the
 intersection of NLP and search.

 Elsewhere in Apache, I'm an active Maven dev, a semi-retired CXF dev,
 and a sort of uncle of several other projects. So I'm prone to be
 helpful or annoying with issues of Maven and Web Services.

 Thanks again, benson

 p.s. I think Uwe has already added me to the necessary wiring; would
 some kind soul please point me to the explanation of how the web site
 is maintained so I can add myself? Is it just the ASF CMS?






 On Sat, Jan 25, 2014 at 4:40 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
  I'm pleased to announce that Benson Margulies has accepted to join our
  ranks as a committer.
 
  Benson has been involved in a number of Lucene/Solr issues over time
  (see
 http://jirasearch.mikemccandless.com/search.py?index=jirachg=ddsa1=allUsersa2=Benson+Margulies
  ), most recently on debugging tricky analysis issues.
 
  Benson, it is tradition that you introduce yourself with a brief bio.
  I know you're heavily involved in other Apache projects already...
 
  Once your account is set up, you should then be able to add yourself
  to the who we are page on the website as well.
 
  Congratulations and welcome!
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-01-22 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879423#comment-13879423
 ] 

Han Jiang commented on LUCENE-3069:
---

Thanks for catching this Mike! I wasn't quick to get that username :p

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.7

 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
 example.png


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Areek Zillur as Lucene/Solr committer!

2014-01-21 Thread Han Jiang
Congratulations and welcome Areek!


On Wed, Jan 22, 2014 at 12:57 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Welcome Areek!

 On Wed, Jan 22, 2014 at 2:00 AM, Areek Zillur areek...@gmail.com wrote:
  Thanks Robert! I am very pleased to be a committer to Lucene/Solr!
 
  I am originally from Dhaka, Bangladesh. I am currently a 4th year
 Computer
  Engineering
  student at University of Waterloo in Canada. I was fortunate enough to
 have
  multiple internships
  all over North America through the university's co-op program.
  I was first introduced to Lucene/Solr in one of my work-terms at A9 and
  loved it.
 
  I really enjoy the open-source development and the friendliness of the
  community
  behind Lucene/Solr.
  In my free time, I enjoy working on working on my  recreational
 algorithmic
  trading system and learning
  new programming languages.
 
  I hope to continue to work on Lucene/Solr and learn a lot more from the
  community!
 
  Thanks,
 
  Areek Zillur
 
 
  On Tue, Jan 21, 2014 at 11:41 AM, Yonik Seeley yo...@heliosearch.com
  wrote:
 
  Welcome Areek!
 
  -Yonik
  http://heliosearch.com -- making solr shine
 
  On Tue, Jan 21, 2014 at 2:26 PM, Robert Muir rcm...@gmail.com wrote:
   I'm pleased to announce that Areek Zillur has accepted to join our
 ranks
   as
   a committer.
  
   Areek has been improving suggester support in Lucene and Solr,
 including
   a
   revamped Solr component slated for the 4.7 release. [1]
  
   Areek, it is tradition that you introduce yourself with a brief bio.
  
   Once your account is setup, you should then be able to add yourself to
   the
   who we are page on the website as well.
  
   Congratulations and welcome!
  
   [1] https://issues.apache.org/jira/browse/SOLR-5378
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Regards,
 Shalin Shekhar Mangar.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Commented] (LUCENE-5376) Add a demo search server

2013-12-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855634#comment-13855634
 ] 

Han Jiang commented on LUCENE-5376:
---

+1, it will be great to have an 'active' demo to show the features :)

I think we should remove those hardcoded classpaths, e.g. in post.py:30?

And will this demo be expected to be the same as jirasearch? Will we need 
further configuration to get the demo webside working? For example I cannot 
find search.py in the sourcecodes.


 Add a demo search server
 

 Key: LUCENE-5376
 URL: https://issues.apache.org/jira/browse/LUCENE-5376
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: lucene-demo-server.tgz


 I think it'd be useful to have a demo search server for Lucene.
 Rather than being fully featured, like Solr, it would be minimal, just 
 wrapping the existing Lucene modules to show how you can make use of these 
 features in a server setting.
 The purpose is to demonstrate how one can build a minimal search server on 
 top of APIs like SearchManager, SearcherLifetimeManager, etc.
 This is also useful for finding rough edges / issues in Lucene's APIs that 
 make building a server unnecessarily hard.
 I don't think it should have back compatibility promises (except Lucene's 
 index back compatibility), so it's free to improve as Lucene's APIs change.
 As a starting point, I'll post what I built for the eating your own dog 
 food search app for Lucene's  Solr's jira issues 
 http://jirasearch.mikemccandless.com (blog: 
 http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html ). It 
 uses Netty to expose basic indexing  searching APIs via JSON, but it's very 
 rough (lots nocommits).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1548830 - /lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFSLockFactory.java

2013-12-07 Thread Han Jiang
OK! thanks for the fix!


On Sat, Dec 7, 2013 at 7:21 PM, Uwe Schindler u...@thetaphi.de wrote:

 Hi Han,

 I committed an even better fix for this, using the native javadocs linking
 instead of an HTML link. By that its automatically also correct in
 branch_4x (where the docs have to go to Java 6 SE javadocs).

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: h...@apache.org [mailto:h...@apache.org]
  Sent: Saturday, December 07, 2013 11:34 AM
  To: comm...@lucene.apache.org
  Subject: svn commit: r1548830 -
  /lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleF
  SLockFactory.java
 
  Author: han
  Date: Sat Dec  7 10:34:21 2013
  New Revision: 1548830
 
  URL: http://svn.apache.org/r1548830
  Log:
  broken link police
 
  Modified:
 
  lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS
  LockFactory.java
 
  Modified:
  lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS
  LockFactory.java
  URL:
  http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/
  apache/lucene/store/SimpleFSLockFactory.java?rev=1548830r1=1548829r
  2=1548830view=diff
  ==
  
  ---
  lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS
  LockFactory.java (original)
  +++
  lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS
  LockFactory.java Sat Dec  7 10:34:21 2013
  @@ -25,7 +25,7 @@ import java.io.IOException;
* File#createNewFile()}./p
*
* pbNOTE:/b the a target=_top
  - *
  href=
 http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#createNew
  File()javadocs
  + *
  href=
 http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createNe
  wFile()javadocs
* for codeFile.createNewFile/code/a contain a vague
* yet spooky warning about not using the API for file
* locking.  This warning was added due to a target=_top



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: [VOTE] Lucene / Solr 4.6.0

2013-11-14 Thread Han Jiang
Wow, congratulations Uwe!


On Fri, Nov 15, 2013 at 4:11 AM, Uwe Schindler u...@thetaphi.de wrote:

 The PMC Chair is going to marry tomorrow... Simon has to come here and not
 do new RCs! :)

 In any case, thanks for doing the release, Simon. I will do the next!

 Uwe



 Simon Willnauer simon.willna...@gmail.com schrieb:

 Thanks Steve I won't get to this until next week. I will upload a new RC on 
 monday.

 Simon

 Sent from my iPhone

  On 14 Nov 2013, at 20:20, Steve Rowe sar...@gmail.com wrote:

  I’ve committed fixes, to lucene_solr_4_6 as well as to branch_4x and 
 trunk, for all the problems I mentioned.


  The first revision including all these is 1542030.

  Steve


  On Nov 14, 2013, at 1:16 PM, Steve Rowe sar...@gmail.com wrote:

  -1

  Smoke tester passes.

  Solr Changes look good, except that the “Upgrading from Solr 4.5.0” 
 section” follows “Detailed Change List”, but should be above it; and one 
 change attribution
 didn’t get recognized because it’s missing parens: Elran Dvir via Erick 
 Erickson.  Definitely not worth a respin in either case.

  Lucene Changes look good, except that the “API Changes” section in 
 Changes.html is formatted as an item in the “Bug Fixes” section, rather 
 than its own section.  I’ll fix.  (The issue is that “API Changes:” in 
 CHANGES.txt has a trailing colon - the section name regex should allow 
 this. )  This is probably not worth a respin.


  Lucene and Solr Documentation pages look good, except that the File 
 Formats” link from the Lucene Documentation page leads to the 4.5 format 
 doc, rather than the 4.6 format doc (Lucene46Codec was introduced by 
 LUCENE-5215).  This is respin-worthy.  Updating this is not automated now 
 - it’s hard-coded in lucene/site/xsl/index.xsl - the default codec doesn’t 
 change in every release.  I’ll try to automate extracting the default from 
 o.a.l.codecs.Codec#defaultCodec [ =
 Codec.forName(“Lucene46”)].

  Lucene and Solr Javadocs look good.

  Steve

  On Nov 14, 2013, at 4:37 AM, Simon Willnauer simon.willna...@gmail.com 
 wrote:

  Please vote for the first Release Candidate for Lucene/Solr 4.6.0


  you can download it here:
  
 http://people.apache.org/~simonw/staging_area/lucene-solr-4.6.0-RC1-rev1541686


  or run the smoke tester directly with this commandline (don't forget
  to set JAVA6_HOME etc.):

  python3.2 -u dev-tools/scripts/smokeTestRelease.py

  
 http://people.apache.org/~simonw/staging_area/lucene-solr-4.6.0-RC1-rev1541686
  1541686 4.6.0 /tmp/smoke_test_4_6



  I integrated the RC into Elasticsearch and all tests pass:

  
 https://github.com/s1monw/elasticsearch/commit/765e3194bb23f202725bfb28d9a2fd7cc71b49de


  Smoketester said: SUCCESS! [1:15:57.339272]

  here is my +1


  Simon

 --

  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

  For additional commands, e-mail: dev-h...@lucene.apache.org



 --

  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

  For additional commands, e-mail: dev-h...@lucene.apache.org


 --

 To unsubscribe,
 e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Welcome Ryan Ernst as Lucene/Solr committer

2013-10-14 Thread Han Jiang
Welcome, Ryan!


On Tue, Oct 15, 2013 at 2:13 AM, Ryan Ernst r...@iernst.net wrote:

 Thanks Adrian.

 I grew up in Bakersfield, CA (colloquially known as the armpit of
 California).  I escaped and went to Cal Poly for my bachelors in computer
 science, and after a very brief stint working on HPUX, I landed working on
 the Amazon search engine for A9. I especially enjoy working with
 compression and encodings, and hope to experiment there some more with
 Lucene.

 Thanks
 Ryan


 On Mon, Oct 14, 2013 at 10:27 AM, Adrien Grand jpou...@gmail.com wrote:

 I'm pleased to announce that Ryan Ernst has accepted to join our ranks
 as a committer.

 Ryan has been working on a number of Lucene and Solr issues and
 recently contributed the new expressions module[1] which allows for
 compiling javascript expressions into SortField instances with
 excellent performance since it doesn't rely on a scripting engine but
 directly generates Java bytecode. This is a very exciting change which
 will be available in Lucene 4.6.

 Ryan, it is tradition that you introduce yourself with a brief bio.

 Congratulations and welcome!

 [1] https://issues.apache.org/jira/browse/LUCENE-5207

 --
 Adrien

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Commented] (LUCENE-5268) Cutover more postings formats to the inverted pull API

2013-10-10 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791278#comment-13791278
 ] 

Han Jiang commented on LUCENE-5268:
---

+1, the pulsing code is much cleaner!

 Cutover more postings formats to the inverted pull API
 

 Key: LUCENE-5268
 URL: https://issues.apache.org/jira/browse/LUCENE-5268
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0

 Attachments: LUCENE-5268.patch


 In LUCENE-5123, we added a new, more flexible, pull API for writing
 postings.  This API allows the postings format to iterate the
 fields/terms/postings more than once, and mirrors the API for writing
 doc values.
 But that was just the first step (only SimpleText was cutover to the
 new API).  I want to cutover more components, so we can (finally)
 e.g. play with different encodings depending on the term's postings,
 such as using a bitset for high freq DOCS_ONLY terms (LUCENE-5052).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1530537 - in /lucene/dev/trunk/lucene: common-build.xml ivy-settings.xml

2013-10-09 Thread Han Jiang
oh, yes, I'll do that!


On Wed, Oct 9, 2013 at 5:17 PM, Robert Muir rcm...@gmail.com wrote:

 Thanks for updating this!

 I think we should merge this back to branch 4.x too? This way the
 source code tar.gz is working from China for our next release?

 2013/10/9  h...@apache.org:
  Author: han
  Date: Wed Oct  9 08:56:15 2013
  New Revision: 1530537
 
  URL: http://svn.apache.org/r1530537
  Log:
  update broken links for maven mirror
 
  Modified:
  lucene/dev/trunk/lucene/common-build.xml
  lucene/dev/trunk/lucene/ivy-settings.xml
 
  Modified: lucene/dev/trunk/lucene/common-build.xml
  URL:
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/common-build.xml?rev=1530537r1=1530536r2=1530537view=diff
 
 ==
  --- lucene/dev/trunk/lucene/common-build.xml (original)
  +++ lucene/dev/trunk/lucene/common-build.xml Wed Oct  9 08:56:15 2013
  @@ -360,7 +360,7 @@
 property name=ivy_install_path location=${user.home}/.ant/lib /
 property name=ivy_bootstrap_url1 value=
 http://repo1.maven.org/maven2/
 !-- you might need to tweak this from china so it works --
  -  property name=ivy_bootstrap_url2 value=
 http://mirror.netcologne.de/maven2/
  +  property name=ivy_bootstrap_url2 value=http://uk.maven.org/maven2
 /
 property name=ivy_checksum_sha1
 value=c5ebf1c253ad4959a29f4acfe696ee48cdd9f473/
 
 target name=ivy-availability-check unless=ivy.available
 
  Modified: lucene/dev/trunk/lucene/ivy-settings.xml
  URL:
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/ivy-settings.xml?rev=1530537r1=1530536r2=1530537view=diff
 
 ==
  --- lucene/dev/trunk/lucene/ivy-settings.xml (original)
  +++ lucene/dev/trunk/lucene/ivy-settings.xml Wed Oct  9 08:56:15 2013
  @@ -35,7 +35,7 @@
   ibiblio name=maven.restlet.org root=http://maven.restlet.org;
 m2compatible=true /
 
   !-- you might need to tweak this from china so it works --
  -ibiblio name=working-chinese-mirror root=
 http://mirror.netcologne.de/maven2; m2compatible=true /
  +ibiblio name=working-chinese-mirror root=
 http://uk.maven.org/maven2; m2compatible=true /
 
   !-- temporary to try Clover 3.2.0 snapshots, see
 https://issues.apache.org/jira/browse/LUCENE-5243,
 https://jira.atlassian.com/browse/CLOV-1368 --
   ibiblio name=atlassian-clover-snapshots root=
 https://maven.atlassian.com/content/repositories/atlassian-public-snapshot;
 m2compatible=true /
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Welcome Joel Bernstein

2013-10-03 Thread Han Jiang
Welcome Joel!


On Thu, Oct 3, 2013 at 1:24 PM, Grant Ingersoll gsing...@apache.org wrote:

 Hi,

 The Lucene PMC is happy to welcome Joel Bernstein as a committer on the
 Lucene and Solr project.  Joel has been working on a number of issues on
 the project and we look forward to his continued contributions going
 forward.

 Welcome aboard, Joel!

 -Grant
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec

2013-09-29 Thread Han Jiang
Hi Paul,

Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant
clean' ?


On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot paul.j.elsc...@gmail.comwrote:

 Dear readers,

 When I update my working copy of lucene core trunk to current latest rev
 1527154, ant compile-test fails with this message:

 ... lucene/trunk/lucene/core/src/**test/org/apache/lucene/index/**
 TestNumericDocValuesUpdates.**java:17: error: cannot find symbol
 [javac] import org.apache.lucene.codecs.**lucene45.Lucene45RWCodec;

 After updating (backdating) to the 26th: svn update -r {20130926}
 ant compile-test works normally.

 I couldn't decide on which issue to post this, so here it is.

 Regards,
 Paul Elschot




 --**--**-
 To unsubscribe, e-mail: 
 dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec

2013-09-29 Thread Han Jiang
Paul, a quick hack (e.g. for your development), is to run the ant command
in /lucene instead of /lucene/core, I don't know why that fails, and hope
someone can explain this, though :)


On Sun, Sep 29, 2013 at 9:54 PM, Paul Elschot paul.j.elsc...@gmail.comwrote:

  Hi Han,

 I just reproduced it three times on my working copy in the directory
 trunk/lucene/core
 with ant clean in the command sequence:

 svn update -r {20130926}
 ant clean
 ant compile-test # build successful

 svn update -r 1527154
 ant clean
 ant compile-test # build failed


 In my working copy  svn status currently produces this:

 M   src/java/org/apache/lucene/util/packed/EliasFanoDecoder.java
 M   src/java/org/apache/lucene/util/packed/EliasFanoDocIdSet.java
 M   src/java/org/apache/lucene/util/packed/EliasFanoEncoder.java
 M   src/test/org/apache/lucene/util/packed/TestEliasFanoSequence.java

 and I don't expect these have an influence.



 To be complete, on revision 1526316 (of the 26th) I also got this output
 (slightly edited) once:

 compile-core:
 [mkdir] Created dir:  lucene/trunk/lucene/build/core/classes/java
 [javac] Compiling 672 source files to ...
 lucene/trunk/lucene/build/core/classes/java
 [javac] An exception has occurred in the compiler (1.7.0_21). Please
 file a bug at the Java Developer Connection (
 http://java.sun.com/webapps/bugreport)  after checking the Bug Parade for
 duplicates. Include your program and the following diagnostic in your
 report.  Thank you.
 [javac] java.lang.AbstractMethodError
 

 I could not reproduce that one with four more attempts, so I do hope that
 that was a one time glitch.

 But it is strange that on my machine:
 javac -version

 produces:
 javac 1.7.0_40

 and the compiler exception message above reports 1.7.0_21.
 Perhaps there is something wrong with my java/javac setup, any advice
 there?

 Regards,
 Paul Elschot


 P.S.
 java -version

 produces:
 java version 1.7.0_40
 Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
 Java HotSpot(TM) Server VM (build 24.0-b56, mixed mode)



 On 29-09-13 15:11, Han Jiang wrote:

 Hi Paul,

 Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant
 clean' ?


 On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot paul.j.elsc...@gmail.comwrote:

 Dear readers,

 When I update my working copy of lucene core trunk to current latest rev
 1527154, ant compile-test fails with this message:

 ...
 lucene/trunk/lucene/core/src/test/org/apache/lucene/index/TestNumericDocValuesUpdates.java:17:
 error: cannot find symbol
 [javac] import org.apache.lucene.codecs.lucene45.Lucene45RWCodec;

 After updating (backdating) to the 26th: svn update -r {20130926}
 ant compile-test works normally.

 I couldn't decide on which issue to post this, so here it is.

 Regards,
 Paul Elschot




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 Han Jiang

 Team of Search Engine and Web Mining,
 School of Electronic Engineering and Computer Science,
 Peking University, China





-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec

2013-09-29 Thread Han Jiang
Shai, I think we should change TestRulSetupAndRestoreClassEnv, I'll upload
a patch for this.


On Sun, Sep 29, 2013 at 10:18 PM, Shai Erera ser...@gmail.com wrote:

 I'm not sure but maybe it is related to the fact you run it from
 lucene/core. Since on LUCENE-5215 (rev 1527154) I created a new
 Lucene46Codec, and moved Lucene45 stuff under test-framework, as well as
 changed Lucene45Codec.fieldInfosFormat to not be final, perhaps you need to
 run 'ant clean' from the root to make sure all changes are compiled
 accordingly?

 Out of curiosity, did you 'svn up' from root, or perhaps from lucene/core
 by accident?

 Shai


 On Sun, Sep 29, 2013 at 4:54 PM, Paul Elschot paul.j.elsc...@gmail.comwrote:

  Hi Han,

 I just reproduced it three times on my working copy in the directory
 trunk/lucene/core
 with ant clean in the command sequence:

 svn update -r {20130926}
 ant clean
 ant compile-test # build successful

 svn update -r 1527154
 ant clean
 ant compile-test # build failed


 In my working copy  svn status currently produces this:

 M   src/java/org/apache/lucene/util/packed/EliasFanoDecoder.java
 M   src/java/org/apache/lucene/util/packed/EliasFanoDocIdSet.java
 M   src/java/org/apache/lucene/util/packed/EliasFanoEncoder.java
 M   src/test/org/apache/lucene/util/packed/TestEliasFanoSequence.java

 and I don't expect these have an influence.



 To be complete, on revision 1526316 (of the 26th) I also got this output
 (slightly edited) once:

 compile-core:
 [mkdir] Created dir:  lucene/trunk/lucene/build/core/classes/java
 [javac] Compiling 672 source files to ...
 lucene/trunk/lucene/build/core/classes/java
 [javac] An exception has occurred in the compiler (1.7.0_21). Please
 file a bug at the Java Developer Connection (
 http://java.sun.com/webapps/bugreport)  after checking the Bug Parade
 for duplicates. Include your program and the following diagnostic in your
 report.  Thank you.
 [javac] java.lang.AbstractMethodError
 

 I could not reproduce that one with four more attempts, so I do hope that
 that was a one time glitch.

 But it is strange that on my machine:
 javac -version

 produces:
 javac 1.7.0_40

 and the compiler exception message above reports 1.7.0_21.
 Perhaps there is something wrong with my java/javac setup, any advice
 there?

 Regards,
 Paul Elschot


 P.S.
 java -version

 produces:
 java version 1.7.0_40
 Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
 Java HotSpot(TM) Server VM (build 24.0-b56, mixed mode)


 On 29-09-13 15:11, Han Jiang wrote:

 Hi Paul,

 Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant
 clean' ?


 On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot 
 paul.j.elsc...@gmail.comwrote:

 Dear readers,

 When I update my working copy of lucene core trunk to current latest rev
 1527154, ant compile-test fails with this message:

 ...
 lucene/trunk/lucene/core/src/test/org/apache/lucene/index/TestNumericDocValuesUpdates.java:17:
 error: cannot find symbol
 [javac] import org.apache.lucene.codecs.lucene45.Lucene45RWCodec;

 After updating (backdating) to the 26th: svn update -r {20130926}
 ant compile-test works normally.

 I couldn't decide on which issue to post this, so here it is.

 Regards,
 Paul Elschot




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 Han Jiang

 Team of Search Engine and Web Mining,
 School of Electronic Engineering and Computer Science,
 Peking University, China






-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation

2013-09-29 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5215:
--

Attachment: LUCENE-5215.patch

Patch for the compile error mentioned by Paul.

 Add support for FieldInfos generation
 -

 Key: LUCENE-5215
 URL: https://issues.apache.org/jira/browse/LUCENE-5215
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch


 In LUCENE-5189 we've identified few reasons to do that:
 # If you want to update docs' values of field 'foo', where 'foo' exists in 
 the index, but not in a specific segment (sparse DV), we cannot allow that 
 and have to throw a late UOE. If we could rewrite FieldInfos (with 
 generation), this would be possible since we'd also write a new generation of 
 FIS.
 # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the 
 consumer isn't allowed to change FI.attributes because we cannot modify the 
 existing FIS. This is implicit however, and we silently ignore any modified 
 attributes. FieldInfos.gen will allow that too.
 The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and 
 add support for FIS generation in FieldInfosFormat, SegReader etc., like we 
 now do for DocValues. I'll work on a patch.
 Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that 
 have same limitation -- if a Codec modifies them, they are silently being 
 ignored, since we don't gen the .si files. I think we can easily solve that 
 by recording SI.attributes in SegmentInfos, so they are recorded per-commit. 
 But I think it should be handled in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation

2013-09-29 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5215:
--

Attachment: LUCENE-5215.patch

 Add support for FieldInfos generation
 -

 Key: LUCENE-5215
 URL: https://issues.apache.org/jira/browse/LUCENE-5215
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch


 In LUCENE-5189 we've identified few reasons to do that:
 # If you want to update docs' values of field 'foo', where 'foo' exists in 
 the index, but not in a specific segment (sparse DV), we cannot allow that 
 and have to throw a late UOE. If we could rewrite FieldInfos (with 
 generation), this would be possible since we'd also write a new generation of 
 FIS.
 # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the 
 consumer isn't allowed to change FI.attributes because we cannot modify the 
 existing FIS. This is implicit however, and we silently ignore any modified 
 attributes. FieldInfos.gen will allow that too.
 The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and 
 add support for FIS generation in FieldInfosFormat, SegReader etc., like we 
 now do for DocValues. I'll work on a patch.
 Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that 
 have same limitation -- if a Codec modifies them, they are silently being 
 ignored, since we don't gen the .si files. I think we can easily solve that 
 by recording SI.attributes in SegmentInfos, so they are recorded per-commit. 
 But I think it should be handled in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation

2013-09-29 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5215:
--

Attachment: (was: LUCENE-5215.patch)

 Add support for FieldInfos generation
 -

 Key: LUCENE-5215
 URL: https://issues.apache.org/jira/browse/LUCENE-5215
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch


 In LUCENE-5189 we've identified few reasons to do that:
 # If you want to update docs' values of field 'foo', where 'foo' exists in 
 the index, but not in a specific segment (sparse DV), we cannot allow that 
 and have to throw a late UOE. If we could rewrite FieldInfos (with 
 generation), this would be possible since we'd also write a new generation of 
 FIS.
 # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the 
 consumer isn't allowed to change FI.attributes because we cannot modify the 
 existing FIS. This is implicit however, and we silently ignore any modified 
 attributes. FieldInfos.gen will allow that too.
 The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and 
 add support for FIS generation in FieldInfosFormat, SegReader etc., like we 
 now do for DocValues. I'll work on a patch.
 Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that 
 have same limitation -- if a Codec modifies them, they are silently being 
 ignored, since we don't gen the .si files. I think we can easily solve that 
 by recording SI.attributes in SegmentInfos, so they are recorded per-commit. 
 But I think it should be handled in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5215) Add support for FieldInfos generation

2013-09-29 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781405#comment-13781405
 ] 

Han Jiang commented on LUCENE-5215:
---

I guess so :)

 Add support for FieldInfos generation
 -

 Key: LUCENE-5215
 URL: https://issues.apache.org/jira/browse/LUCENE-5215
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, 
 LUCENE-5215.patch


 In LUCENE-5189 we've identified few reasons to do that:
 # If you want to update docs' values of field 'foo', where 'foo' exists in 
 the index, but not in a specific segment (sparse DV), we cannot allow that 
 and have to throw a late UOE. If we could rewrite FieldInfos (with 
 generation), this would be possible since we'd also write a new generation of 
 FIS.
 # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the 
 consumer isn't allowed to change FI.attributes because we cannot modify the 
 existing FIS. This is implicit however, and we silently ignore any modified 
 attributes. FieldInfos.gen will allow that too.
 The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and 
 add support for FIS generation in FieldInfosFormat, SegReader etc., like we 
 now do for DocValues. I'll work on a patch.
 Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that 
 have same limitation -- if a Codec modifies them, they are silently being 
 ignored, since we don't gen the .si files. I think we can easily solve that 
 by recording SI.attributes in SegmentInfos, so they are recorded per-commit. 
 But I think it should be handled in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome back, Wolfgang Hoschek!

2013-09-27 Thread Han Jiang
Welcome back Wolfgang!


On Fri, Sep 27, 2013 at 2:19 PM, Robert Muir rcm...@gmail.com wrote:

 Welcome back!

 On Thu, Sep 26, 2013 at 6:21 AM, Uwe Schindler uschind...@apache.org
 wrote:
  Hi,
 
  I'm pleased to announce that after a long abstinence, Wolfgang Hoschek
 rejoined the Lucene/Solr committer team. He is working now at Cloudera and
 plans to help with the integration of Solr and Hadoop.
  Wolfgang originally wrote the MemoryIndex, which is used by the
 classical Lucene highlighter and ElasticSearch's percolator module.
 
  Looking forward to new contributions.
 
  Welcome back  heavy committing! :-)
  Uwe
 
  P.S.: Wolfgang, as soon as you have setup your subversion access, you
 should add yourself back to the committers list on the website as well.
 
  -
  Uwe Schindler
  uschind...@apache.org
  Apache Lucene PMC Chair / Committer
  Bremen, Germany
  http://lucene.apache.org/
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Commented] (LUCENE-5123) invert the codec postings API

2013-09-20 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772862#comment-13772862
 ] 

Han Jiang commented on LUCENE-5123:
---

Nice change! Although PushFieldsConsumer is still using the old API, I like the 
migrating
of flush() logic from FreqProxTermsWriterPerField to PushFieldsConsumer, the 
calling chain is 
more clear in codec level now. :)

Also, I'm quite curious whether StoredFields and TermVectors will get rid of 
'merge()' later.


 invert the codec postings API
 -

 Key: LUCENE-5123
 URL: https://issues.apache.org/jira/browse/LUCENE-5123
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 5.0

 Attachments: LUCENE-5123.patch, LUCENE-5123.patch, LUCENE-5123.patch, 
 LUCENE-5123.patch, LUCENE-5123.patch


 Currently FieldsConsumer/PostingsConsumer/etc is a push oriented api, e.g. 
 FreqProxTermsWriter streams the postings at flush, and the default merge() 
 takes the incoming codec api and filters out deleted docs and pushes via 
 same api (but that can be overridden).
 It could be cleaner if we allowed for a pull model instead (like 
 DocValues). For example, maybe FreqProxTermsWriter could expose a Terms of 
 itself and just passed this to the codec consumer.
 This would give the codec more flexibility to e.g. do multiple passes if it 
 wanted to do things like encode high-frequency terms more efficiently with a 
 bitset-like encoding or other things...
 A codec can try to do things like this to some extent today, but its very 
 difficult (look at buffering in Pulsing). We made this change with DV and it 
 made a lot of interesting optimizations easy to implement...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Can we use TREC data set in open source?

2013-09-20 Thread Han Jiang
 I read here http://lemurproject.org/clueweb09/ that there is a hosted
 version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a
 hosted version), and to get access to it, someone from the ASF will need
 to sign an Organizational Agreement with them as well as each individual
 in the project will need to sign an Individual Agreement (retained by the
 ASF). Perhaps this can be available only to committers.

This is nice! I'll try to ask ASF about this.

 To this day, I think the only way it will happen is for the community
 to build a completely open system, perhaps based off of Common Crawl or
 our own crawl and host it ourselves and develop judgments, etc.

Yeah, this is what we need in ORP.

 Most people like the idea, but are not sure how to distribute it in an
 open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure
 how they would handle any copyright/redaction claims against it.  There
 is, of course, little incentive for those involved to solve these, either,
 as most people who are interested sign the form and pay the $600 for the
 disks.

Sigh, yes, it is hard to make a data set totally public. Actually, one of
my purpose in this question is to see whether it is acceptable in our
community (i.e. lucene/solr only) to obtain a data set not open to all
people. When expand to a larger scope, the license issue is somewhat
hairy...


And since Shai has found a possible 'free' data set, I think it is possible
for ASF to obtain an Organizational Agreement for this. I'll try to contact
ASF  CMU about how they define person with the authority in OSS.


On Tue, Sep 17, 2013 at 6:11 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Inline below

 On Sep 9, 2013, at 10:53 PM, Han Jiang jiangha...@gmail.com wrote:

 Back in 2007 Grant contacted with NIST about making TREC collection
 available to our community:

 http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser

 I think a try for this is really important to our project and people who
 use Lucene. All these years the speed performance is mainly tuned on
 Wikipedia, however it's not very 'standard':

 * it doesn't represent how real-world search works;
 * it cannot be used to evaluate the relevance of our scoring models;
 * researchers tend to do experiments on other data sets, and usually it is
   hard to know whether Lucene performs its best performance;

 And personally I agree with this line:

  I think it would encourage Lucene users/developers to think about
  relevance as much as we think about speed.

 There's been much work to make Lucene's scoring models pluggable in 4.0,
 and it'll be great if we can explore more about it. It is very appealing
 to
 see a high-performance library work along with state-of-the-art ranking
 methods.


 And about TREC data set, the problems we met are:

 1. NIST/TREC does not own the original collections, therefore it might be
necessary to have direct contact with those organizations who really
 did,
such as:

http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
http://lemurproject.org/clueweb12/

 2. Currently, there is no open-source license for any of the data sets, so
it won't be as 'open' as Wikipedia is.

As is proposed by Grant, a possibility is to make the data set
 accessible
only to committers instead of all users. It is not very open-source
 then,
but TREC data sets is public and usually available to researchers, so
people can still reproduce performance test.

 I'm quite curious, has anyone explored getting an open-source license for
 one of those data sets? And is our community still interested about this
 issue after all these years?


 It continues to be of interest to me.  I've had various conversations
 throughout the years on it.  Most people like the idea, but are not sure
 how to distribute it in an open way (ClueWeb comes as 4 1TB disks right
 now) and I am also not sure how they would handle any copyright/redaction
 claims against it.  There is, of course, little incentive for those
 involved to solve these, either, as most people who are interested sign the
 form and pay the $600 for the disks.  I've had a number of conversations
 about how I view this to be a significant barrier to open research, esp. in
 under-served countries and to open source.  People sympathize with me, but
 then move on.

 To this day, I think the only way it will happen is for the community to
 build a completely open system, perhaps based off of Common Crawl or our
 own crawl and host it ourselves and develop judgments, etc.  We tried to
 get this off the ground w/ the Open Relevance Project, but there was never
 a sustainable effort, and thus I have little hope at this point for it (but
 I would love to be proven wrong)  For it to succeed, I think we would need
 the backing of a University with students interested in curating such a
 collection, the judgments, etc.  I think we could figure out how to
 distribute the data either

Can we use TREC data set in open source?

2013-09-09 Thread Han Jiang
Back in 2007 Grant contacted with NIST about making TREC collection
available to our community:

http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser

I think a try for this is really important to our project and people who
use Lucene. All these years the speed performance is mainly tuned on
Wikipedia, however it's not very 'standard':

* it doesn't represent how real-world search works;
* it cannot be used to evaluate the relevance of our scoring models;
* researchers tend to do experiments on other data sets, and usually it is
  hard to know whether Lucene performs its best performance;

And personally I agree with this line:

 I think it would encourage Lucene users/developers to think about
 relevance as much as we think about speed.

There's been much work to make Lucene's scoring models pluggable in 4.0,
and it'll be great if we can explore more about it. It is very appealing to
see a high-performance library work along with state-of-the-art ranking
methods.


And about TREC data set, the problems we met are:

1. NIST/TREC does not own the original collections, therefore it might be
   necessary to have direct contact with those organizations who really did,
   such as:

   http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
   http://lemurproject.org/clueweb12/

2. Currently, there is no open-source license for any of the data sets, so
   it won't be as 'open' as Wikipedia is.

   As is proposed by Grant, a possibility is to make the data set accessible
   only to committers instead of all users. It is not very open-source then,
   but TREC data sets is public and usually available to researchers, so
   people can still reproduce performance test.

I'm quite curious, has anyone explored getting an open-source license for
one of those data sets? And is our community still interested about this
issue after all these years?



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China


[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-06 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760160#comment-13760160
 ] 

Han Jiang commented on LUCENE-3069:
---

Mike, thanks for the review!

bq. In general, couldn't the writer re-use the reader's TermState?

I'm afraid this somewhat makes codes longer? I'll make a patch to see this.

{quote}
Have you run first do no harm perf tests? Ie, compare current trunk
w/ default Codec to branch w/ default Codec? Just to make sure there
are no surprises...
{quote}

Yes, no surprise yet.

bq. Why does Lucene41PostingsWriter have impersonation code? 

Yeah, these should be removed.

{quote}
I forget: why does the postings reader/writer need to handle delta
coding again (take an absolute boolean argument)? Was it because of
pulsing or sep? It's fine for now (progress not perfection) ... but
not clean, since delta coding is really an encoding detail so in
theory the terms dict should own that ...
{quote}

Ah, yes, because of pulsing.

This is because.. PulsingPostingsBase is more than a PostingsBaseFormat. 
It somewhat acts like a term dict, e.g. it needs to understand how terms are 
structured in one block (term No.1 uses absolute value, term No.x use delta 
value)
then judge how to restruct the inlined and wrapped block (No.1 still uses 
absolute value,
but the first-non-pulsed term will need absolute encoding as well). 

Without the argument 'absolute', the real term dictionary will do the delta 
encoding itself,
then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to 
encode 
metadata values without delta-format.



{quote}
The new .smy file for Pulsing is sort of strange ... but necessary
since it always uses 0 longs, so we have to store this somewhere
... you could put it into FieldInfo attributes instead?
{quote}

Yeah, it is another hairy thing... the reason is, we don't have a 
'PostingsTrailer'
for PostingsBaseFormat. Pulsing will not know the longs size for each field, 
until 
all the fields are consumed... and it should not write those longsSize to 
termsOut in close()
since the term dictionary will use the DirTrailer hack here. (maybe every term 
dictionary
should close postingsWriter first, then write field summary and close itself? 
I'm not sure 
though). 


bq. Should we backport this to 4.x? 

Yeah, OK!

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-06 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760325#comment-13760325
 ] 

Han Jiang commented on LUCENE-3069:
---

I think this is ready to commit to trunk now, and I'll wait for a day or two 
before committing it. :)

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-04 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757676#comment-13757676
 ] 

Han Jiang commented on LUCENE-3069:
---

OK! These two term dicts are both FST-based:

* FST term dict directly uses FST to map term to its metadata  stats 
(FSTTermData)
* FSTOrd term dict uses FST to map term to its ordinal number (FSTLong), and 
the ordinal is then used to seek metadata from another big chunk.

I prefer the second impl since it puts much less stress on FST.

I have updated the detailed format explaination in last commit. Hmm, I'll 
create another patch for this...

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-04 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757771#comment-13757771
 ] 

Han Jiang commented on LUCENE-3069:
---

Yes, with slight changes, it can support seek by ord. (With FST.getByOutput). 

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-04 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch from last commit, and summary:

Previously our term dictionary were both block-based: 

* BlockTerms dict breaks terms list into several blocks, as a linear 
  structure with skip points. 

* BlockTreeTerms dict uses a trie-like structure to decide how terms are 
  assigned to different blocks, and uses an FST index to optimize seeking 
  performance.

However, those two kinds of term dictionary don't hold all the term 
data in memory. For the worst case there would be at least two seeks:
one from index in memory, another from file on disk. And we already have 
many complicated optimizations for this...

If by design a term dictionary can be memory resident, the data structure 
will be simpler (after all we don't need maintain extra file pointers for 
a second-time seek, and we don't have to decide heuristic for how terms 
are clustered). And this is why those two FST-based implementation are 
introduced.

Another big change in the code is: since our term dictionaries were both 
block-based, previous API was also limited. It was the postings writer who 
collected term metadata, and the term dictionary who told postings writer 
the range of terms it should flush to block. However, encoding of terms 
data should be decided by term dictionary part, since postings writer 
doesn't always know how terms are structured in term dictionary...
Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had
to use terms' ordinal in block to decide how to write metadata, which is 
unnecessary.

To make the API between term dict and postings list more 'pluggable' and 
'general', I refactored the PostingsReader/WriterBase. For example, the 
postings writer should provide some information to term dictionary, like 
how many metadata values are strictly monotonic, so that term dictionary 
can optimize delta-encoding itself. And since the term dictionary now fully
decides how metadata are written, it gets the ability to utilize 
intblock-based metadata encoding.

Now the two implementations of term dictionary can easily be plugged with 
current postings formats, like:
* FST41 = 
FSTTermdict + Lucene41PostingsBaseFormat,
* FSTOrd41 = 
FSTOrdTermdict + Lucene41PostingsBaseFormat. 
* FSTOrdPulsing41 = 
FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat

About performance, as shown before, those two term dict improve on primary 
key lookup, but still have overhead on wildcard query (both two term dict 
have only prefix information, and term dictionary cannot work well with 
this...). I'll try to hack this later.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-03 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

The uploaded patch should show all the changes against trunk: I added two 
different implementations of term dict, and refactored the PostingsBaseFormat 
to plug in non-block based term dicts.

I'm still working on the javadocs, and maybe we should rename that 'temp' 
package, like 'fstterms'?



 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5199) Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field

2013-09-03 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756670#comment-13756670
 ] 

Han Jiang commented on LUCENE-5199:
---

Thanks Shai!

 Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual 
 DocValuesFormat used per-field
 ---

 Key: LUCENE-5199
 URL: https://issues.apache.org/jira/browse/LUCENE-5199
 Project: Lucene - Core
  Issue Type: Improvement
  Components: general/test
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 5.0, 4.5

 Attachments: LUENE-5199.patch


 On LUCENE-5178 Han reported the following test failure:
 {noformat}
 [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues 
[junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0)
[junit4]   less than 10 ([8)
[junit4]   less than or equal to 10 (]8)
[junit4]   over 90 (8)
[junit4]   9... but was:...(0)
[junit4]   less than 10 ([28)
[junit4]   less than or equal to 10 (2]8)
[junit4]   over 90 (8)
[junit4]   9...
[junit4]  at 
 __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0)
[junit4]  at 
 org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670)
[junit4]  at java.lang.Thread.run(Thread.java:722)
 {noformat}
 which can be reproduced with
 {noformat}
 tcase=TestRangeAccumulator -Dtests.method=testMissingValues 
 -Dtests.seed=815B6AA86D05329C -Dtests.slow=true 
 -Dtests.postingsformat=Lucene41 -Dtests.locale=ca 
 -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8
 {noformat}
 It seems that the Codec that is picked is a Lucene45Codec with 
 Lucene42DVFormat, which does not support docsWithFields for numericDV. We 
 should improve LTC.defaultCodecSupportsDocsWithField to take a list of fields 
 and check that the actual DVF used for each field supports it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5199) Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field

2013-09-03 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756766#comment-13756766
 ] 

Han Jiang commented on LUCENE-5199:
---

Thanks Rob! Yeah, I just hit another failure around TestSortDocValues. :)

 Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual 
 DocValuesFormat used per-field
 ---

 Key: LUCENE-5199
 URL: https://issues.apache.org/jira/browse/LUCENE-5199
 Project: Lucene - Core
  Issue Type: Improvement
  Components: general/test
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5199.patch, LUENE-5199.patch


 On LUCENE-5178 Han reported the following test failure:
 {noformat}
 [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues 
[junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0)
[junit4]   less than 10 ([8)
[junit4]   less than or equal to 10 (]8)
[junit4]   over 90 (8)
[junit4]   9... but was:...(0)
[junit4]   less than 10 ([28)
[junit4]   less than or equal to 10 (2]8)
[junit4]   over 90 (8)
[junit4]   9...
[junit4]  at 
 __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0)
[junit4]  at 
 org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670)
[junit4]  at java.lang.Thread.run(Thread.java:722)
 {noformat}
 which can be reproduced with
 {noformat}
 tcase=TestRangeAccumulator -Dtests.method=testMissingValues 
 -Dtests.seed=815B6AA86D05329C -Dtests.slow=true 
 -Dtests.postingsformat=Lucene41 -Dtests.locale=ca 
 -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8
 {noformat}
 It seems that the Codec that is picked is a Lucene45Codec with 
 Lucene42DVFormat, which does not support docsWithFields for numericDV. We 
 should improve LTC.defaultCodecSupportsDocsWithField to take a list of fields 
 and check that the actual DVF used for each field supports it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5178) doc values should expose missing values (or allow configurable defaults)

2013-09-02 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756313#comment-13756313
 ] 

Han Jiang commented on LUCENE-5178:
---

During test I somehow hit a failure:

{noformat}
   [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues 
   [junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0)
   [junit4]   less than 10 ([8)
   [junit4]   less than or equal to 10 (]8)
   [junit4]   over 90 (8)
   [junit4]   9... but was:...(0)
   [junit4]   less than 10 ([28)
   [junit4]   less than or equal to 10 (2]8)
   [junit4]   over 90 (8)
   [junit4]   9...
   [junit4]at 
__randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0)
   [junit4]at 
org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670)
   [junit4]at java.lang.Thread.run(Thread.java:722)
{noformat}

Seed:
{noformat}
ant test  -Dtestcase=TestRangeAccumulator -Dtests.method=testMissingValues 
-Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 
-Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8
{noformat}

 doc values should expose missing values (or allow configurable defaults)
 

 Key: LUCENE-5178
 URL: https://issues.apache.org/jira/browse/LUCENE-5178
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5178.patch, LUCENE-5178_reintegrate.patch


 DocValues should somehow allow a configurable default per-field.
 Possible implementations include setting it on the field in the document or 
 registration of an IndexWriter callback.
 If we don't make the default configurable, then another option is to have 
 DocValues fields keep track of whether a value was indexed for that document 
 or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5194) TestBackwardsCompatibility should not test Pulsing41

2013-08-29 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754307#comment-13754307
 ] 

Han Jiang commented on LUCENE-5194:
---

Thanks Mike!

 TestBackwardsCompatibility should not test Pulsing41
 

 Key: LUCENE-5194
 URL: https://issues.apache.org/jira/browse/LUCENE-5194
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 5.0, 4.5


 Spinoff from LUCENE-3069, where Billy discovered this ...
 For some reason it's currently testing a Pulsing41 index (at least 
 index.41.cfs.zip), but we do not guarantee back compat for PulsingPF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-28 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, to show the impersonation hack for Pulsing format. 

We cannot perfectly impersonate old pulsing format yet: the old format divided 
metadata block as inlined bytes and wrapped bytes, so when the term dict reader 
reads the length of metadata block, it is actually the length of 'inlined 
block'... And the 'wrapped block' won't be loaded for wrapped PF.

However, to introduce a new method in PostingsReaderBase doesn't seem to be a 
good way...

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, it will show how current codecs (Block/BlockTree + 
Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. 
TestBackwardsCompatibility still fails, and I'll work on the impersonation 
later.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748582#comment-13748582
 ] 

Han Jiang commented on LUCENE-3069:
---

bq. Patch looks great on quick look! I'll look more when I'm back
bq. online...

OK! I commit it so that we can see later changes.

bq. One thing: I think e.g. BlockTreeTermsReader needs some back-compat
bq. code, so it won't try to read longsSize on old indices?

Yes, both two Block* term dict will have a new VERSION variable to mark the
change, and if codec header shows a previous version, they will not read
that longSize VInt.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-17 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742902#comment-13742902
 ] 

Han Jiang commented on LUCENE-5179:
---

Thanks! I'll commit.

 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-17 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5179:
--

Issue Type: Sub-task  (was: Improvement)
Parent: LUCENE-3069

 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-17 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang closed LUCENE-5179.
-


 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-17 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang resolved LUCENE-5179.
---

   Resolution: Fixed
Lucene Fields: New,Patch Available  (was: New)

 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-16 Thread Han Jiang (JIRA)
Han Jiang created LUCENE-5179:
-

 Summary: Refactoring on PostingsWriterBase for delta-encoding
 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5


A further step from LUCENE-5029.

The short story is, previous API change brings two problems:
* it somewhat breaks backward compatibility: although we can still read old 
format,
  we can no longer reproduce it;
* pulsing codec have problem with it.

And long story...

With the change, current PostingsBase API will be like this:

* term dict tells PBF we start a new term (via startTerm());
* PBF adds docs, positions and other postings data;
* term dict tells PBF all the data for current term is completed (via 
finishTerm()),
  then PBF returns the metadata for current term (as long[] and byte[]);
* term dict might buffer all the metadata in an ArrayList. when all the term is 
collected,
  it then decides how those metadata will be located on disk.

So after the API change, PBF no longer have that annoying 'flushTermBlock', and 
instead
term dict maintains the term, metadata list.

However, for each term we'll now write long[] blob before byte[], so the index 
format is not consistent with pre-4.5.
like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we 
have to write as longA,longB,bytesA.

Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
delta-encoded, after all
PulsingPostingsWriter is only a PBF.

For example, we have terms=[a, a1, a2, b, b1 b2] and 
itemsInBlock=2, so theoretically
we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
with this
approach, the metadata of term b is delta encoded base on metadata of a. 
but when term dict tells
PBF to finishTerm(b), it might silly do the delta encode base on term a2.

So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, 
FieldInfo, TermState, boolean absolute)',
so that during metadata flush, we can control how current term is written? And 
the term dict will buffer TermState, which
implicitly holds metadata like we do in PBReader side.

For example, if we want to reproduce old lucene41 format , we can simple set 
longsSize==0, then PBF
writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
issue is solved.
For pulsing codec, it will also be able to tell lower level how to encode 
metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5179:
--

Attachment: LUCENE-5179.patch

Patch for branch3069, tests pass for all 'temp' postings format.

 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-16 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742787#comment-13742787
 ] 

Han Jiang commented on LUCENE-5179:
---

bq. Is it for real back compat or for impersonation ?
bq. Real back-compat (reader can read the old index format using the new APIs) 
should work fine, I think?

Yes, this should be 'impersonation', but actually the back-compat I mentioned 
is a weak requirement.
I'm not happy with this revert as well, so let's see if we can do something to 
hack it! :)

The strong requirement is, if we need pulsing work with the new API, there 
should be something to tell pulsing how to encode each term.

Ideally pulsing should tell term dict longsSize=0, while maintaining wrapped 
PF's longsSize.

The calling chain is:

{noformat}
 termdict ~~finishTermA(long[0], byte[]...)~ pulsing ~~finishTermB(long[3], 
byte[]...)~ wrappedPF
{noformat}

Take the terms=[ a a1 ... ] example, when term b is finished:

the wrappedPF will fill long[] and byte[] with its metatdata, and pulsing will 
instead fills byte[]
as its 'fake' metadata. When term is not inlined, pulsing will have to encode 
wrapped PF's long[] into byte[],
but its too early! Since term b should be delta-encoded with term a, and 
pulsing will never know this...

If we only need pulsing to work, there is a trade off: the pulsing returns 
wrapped PF's longsSize,
and term dict can do the buffering. For Lucene41Pulsing with position+payloads, 
we'll have to write 3 zero VLong,
along with the pulsing byte[] for an inlined term... and it's not actually 
'pulsing' then.





 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-16 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742792#comment-13742792
 ] 

Han Jiang commented on LUCENE-5179:
---

By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding 
you mentioned.

You can have a look at the changed TempFSTTermsWriter, here we always pass 
'true' to encodeTerm, 
so PBF will not do any delta encoding. Instead FST takes the responsibility. 

When we need to block encode the long[] for a whole term block, term dict can 
simply buffer all the 
long[] returned by encodeTerm(...,true), then use the compressin algorithm.

Whether to do VLong encode is decided by term dict, not PBF. 'encodeTerm' only 
operates 'delta-encode',
and provde PBF the chance to know how current term is flushed along with other 
terms.

 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

2013-08-16 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742792#comment-13742792
 ] 

Han Jiang edited comment on LUCENE-5179 at 8/17/13 2:05 AM:


By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding 
you mentioned.

You can have a look at the changed TempFSTTermsWriter, here we always pass 
'true' to encodeTerm, 
so PBF will not do any delta encoding. Instead FST takes the responsibility. 

When we need to block encode the long[] for a whole term block, term dict can 
simply buffer all the 
long[] returned by encodeTerm(...,true), then use the compressin algorithm.

Whether to do VLong/delta encode is still decided by term dict, not 
PBF.'encodeTerm' only operates 
'delta-encode', and provde PBF the chance to know how current term is flushed 
along with other terms.

  was (Author: billy):
By the way, Mike, I think this change doesn't preclude the Simple9/16 
encoding you mentioned.

You can have a look at the changed TempFSTTermsWriter, here we always pass 
'true' to encodeTerm, 
so PBF will not do any delta encoding. Instead FST takes the responsibility. 

When we need to block encode the long[] for a whole term block, term dict can 
simply buffer all the 
long[] returned by encodeTerm(...,true), then use the compressin algorithm.

Whether to do VLong encode is decided by term dict, not PBF. 'encodeTerm' only 
operates 'delta-encode',
and provde PBF the chance to know how current term is flushed along with other 
terms.
  
 Refactoring on PostingsWriterBase for delta-encoding
 

 Key: LUCENE-5179
 URL: https://issues.apache.org/jira/browse/LUCENE-5179
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Assignee: Han Jiang
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5179.patch


 A further step from LUCENE-5029.
 The short story is, previous API change brings two problems:
 * it somewhat breaks backward compatibility: although we can still read old 
 format,
   we can no longer reproduce it;
 * pulsing codec have problem with it.
 And long story...
 With the change, current PostingsBase API will be like this:
 * term dict tells PBF we start a new term (via startTerm());
 * PBF adds docs, positions and other postings data;
 * term dict tells PBF all the data for current term is completed (via 
 finishTerm()),
   then PBF returns the metadata for current term (as long[] and byte[]);
 * term dict might buffer all the metadata in an ArrayList. when all the term 
 is collected,
   it then decides how those metadata will be located on disk.
 So after the API change, PBF no longer have that annoying 'flushTermBlock', 
 and instead
 term dict maintains the term, metadata list.
 However, for each term we'll now write long[] blob before byte[], so the 
 index format is not consistent with pre-4.5.
 like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
 we have to write as longA,longB,bytesA.
 Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
 delta-encoded, after all
 PulsingPostingsWriter is only a PBF.
 For example, we have terms=[a, a1, a2, b, b1 b2] and 
 itemsInBlock=2, so theoretically
 we'll finally have three blocks in BTTR: [a b]  [a1 a2]  [b1 b2], 
 with this
 approach, the metadata of term b is delta encoded base on metadata of a. 
 but when term dict tells
 PBF to finishTerm(b), it might silly do the delta encode base on term a2.
 So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
 out, FieldInfo, TermState, boolean absolute)',
 so that during metadata flush, we can control how current term is written? 
 And the term dict will buffer TermState, which
 implicitly holds metadata like we do in PBReader side.
 For example, if we want to reproduce old lucene41 format , we can simple set 
 longsSize==0, then PBF
 writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
 issue is solved.
 For pulsing codec, it will also be able to tell lower level how to encode 
 metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-15 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, update BlockTerms dict so that it follows refactored API.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738105#comment-13738105
 ] 

Han Jiang commented on LUCENE-3069:
---

Hi, currently, we have problem when migrating the codes to trunk:

The API refactoring on PostingsReader/WriterBase now splits term metadata into 
two parts:
monotonic long[] and generical byte[], the former is known by term dictionary 
for better
d-gap encoding. 

So we need a 'longsSize' in field summary, to tell reader the fixed length of 
this monotonic
long[]. However, this API change actually breaks backward compability: the old 
4.x indices didn't 
support this, and for some codec like Lucene40, since their writer part are 
already deprecated, 
their tests won't pass.

It seems like we can put all the metadata in generic byte[] and let PBF do its 
own buffering 
(like we do in old API: nextTerm() ), however we'll have to add logics for 
this, in every PBF then.

So... can we solve this problem more elegantly?

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-13 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch with backward compability fix on Lucene41PBF (TempPostingsReader is 
actually a fork of Lucene41PostingsReader).

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-02 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Uploaded patch.

It is optimized for wildcardquery, and I did a quick test on 1M wiki data:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  314.63  (1.5%)  314.64  (1.2%)
0.0% (  -2% -2%)
  Fuzzy1   91.32  (3.7%)   92.50  (1.6%)
1.3% (  -3% -6%)
 Respell  104.54  (3.9%)  106.97  (1.6%)
2.3% (  -2% -8%)
  Fuzzy2   38.22  (4.1%)   39.16  (1.2%)
2.5% (  -2% -8%)
Wildcard  109.56  (3.1%)  273.42  (5.0%)  
149.6% ( 137% -  162%)
{noformat}

and TempFSTOrd vs. Lucene41, on 1M data:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell  134.85  (3.7%)  106.30  (0.6%)  
-21.2% ( -24% -  -17%)
  Fuzzy2   47.78  (4.1%)   39.03  (0.9%)  
-18.3% ( -22% -  -13%)
  Fuzzy1  112.02  (3.0%)   91.95  (0.6%)  
-17.9% ( -20% -  -14%)
Wildcard  326.68  (3.5%)  273.41  (1.9%)  
-16.3% ( -20% -  -11%)
PKLookup  194.61  (1.8%)  314.24  (0.7%)   
61.5% (  57% -   65%)
{noformat}

But I'm not happy with it :(, the hack I did here is to consume another big 
block to store the last byte of each term. So for wildcard query ab*c, we have 
external information to tell the ord of nearest term like *c. Knowing the ord, 
we can use a similar approach like getByOutput to jump to the next target term.

Previously, we have to walk on fst to the stop node to find out whether the 
last byte is 'c', so this optimization comes to be a big chunk.

However I don't really like this patch :(, we have to increase index size (521M 
= 530M), and the code comes to be mess up, since we always have to foresee the 
next arc on current stack. 

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: VInt block lenght in Lucene 4.1 postings format

2013-08-01 Thread Han Jiang
Hi Aleksandra,

The PostingsReader uses a skip list to determine the start file
pointer of each block (both FOR packed and vInt encoded). The
information
is currently maintained by Lucene41SkipReader.

The tricky part is, for each term, the skip data is exactly at the end
of TermFreqs blocks, so, if you fetch the startFP for vInt block, and
knows the docTermStartOffset  skipOffset for current term, you can
calculate out what you need.

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Frequencies

On Thu, Aug 1, 2013 at 4:20 PM, Aleksandra Woźniak
aleksandra.k.wozn...@gmail.com wrote:
 Hi all,

 recently I wanted to try out some modifications of Lucene's postings format
 (namely, copying blocks that have no deletions without int-decoding/encoding
 -- this is similar to what was described here:
 https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing
 Lucene 4.1 postings format to check what can be done there.

 I came across the following problem: in Lucene41PostingsReader the length
 (number of bytes) of the last, vInt-encoded, block of posting in not known
 before all individual postings are read and decoded. When reading this block
 we only know the number of postings that should be read and decoded -- since
 vInts have different sizes by definition.

 If I wanted to copy the whole block without vInt decoding/encoding, I need
 to know how many bytes I have to read from postings index input. So, my
 question is: is there a clean way to determine the length of this block (ie.
 the number of bytes that this block has)? Is the number of bytes in a
 posting list tracked somewhere in Lucene 4.1 postings format?

 Thanks,
 Aleksandra



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-31 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724955#comment-13724955
 ] 

Han Jiang commented on LUCENE-3069:
---

Performance result after last patch(intersect) is applied.

On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with 
intersect):
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  232.47  (1.0%)  205.28  (2.0%)  
-11.7% ( -14% -   -8%)
 Prefix3   26.93  (1.2%)   28.40  (1.4%)
5.5% (   2% -8%)
Wildcard6.75  (2.1%)7.37  (1.5%)
9.2% (   5% -   13%)
  Fuzzy1   29.86  (1.8%)   51.87  (3.7%)   
73.7% (  67% -   80%)
  Fuzzy2   30.82  (1.6%)   53.82  (2.7%)   
74.7% (  69% -   80%)
 Respell   27.30  (1.2%)   49.55  (2.6%)   
81.5% (  76% -   86%)
{noformat}

So the decoding of outputs is really the main hurt.

And now we should start to compare it with trunk (base=Lucene41, 
comp=TempFSTOrd):
Hmm, I must have done something wrong on wildcard query here.

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
Wildcard   19.21  (2.1%)7.30  (0.3%)  
-62.0% ( -63% -  -60%)
 Prefix3   33.69  (1.2%)   28.18  (0.9%)  
-16.4% ( -18% -  -14%)
  Fuzzy1   61.59  (2.1%)   52.36  (0.8%)  
-15.0% ( -17% -  -12%)
  Fuzzy2   60.94  (1.0%)   54.15  (1.3%)  
-11.1% ( -13% -   -8%)
 Respell   54.21  (2.8%)   49.54  (1.2%)   
-8.6% ( -12% -   -4%)
PKLookup  148.40  (1.0%)  208.07  (3.6%)   
40.2% (  35% -   45%)
{noformat}

I'll commit current version so we can iterate on it.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-31 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13725288#comment-13725288
 ] 

Han Jiang commented on LUCENE-3069:
---

bq. Maybe try testing on a different wildcard query, e.g. something like a*b* 
(that does not have a commonSuffix)?

I replace all the ab*c in tasks file with ab*c*, but the performance hit is 
still heavy:

33M wikidata, Lucene41 vs. TempFSTOrd
{noformat}
Wildcard7.40  (1.9%)4.63  (1.2%)  -37.5% ( -39% -  -34%)
{noformat} 

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Cassandra Targett as Lucene/Solr committer

2013-07-31 Thread Han Jiang
Welcome Cassandra!

On Thu, Aug 1, 2013 at 6:47 AM, Robert Muir rcm...@gmail.com wrote:
 I'm pleased to announce that Cassandra Targett has accepted to join our
 ranks as a committer.

 Cassandra worked on the donation of the new Solr Reference Guide [1] and
 getting things in order for its first official release [2].
 Cassandra, it is tradition that you introduce yourself with a brief bio.

 Welcome!

 P.S. As soon as your SVN access is setup, you should then be able to add
 yourself to the committers list on the website as well.

 [1]
 https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
 [2] https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/




-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5152) Lucene FST is not immutale

2013-07-30 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723922#comment-13723922
 ] 

Han Jiang commented on LUCENE-5152:
---

bq. So its really just a BytesRef bug right? 
+1, so tricky

 Lucene FST is not immutale
 --

 Key: LUCENE-5152
 URL: https://issues.apache.org/jira/browse/LUCENE-5152
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/FSTs
Affects Versions: 4.4
Reporter: Simon Willnauer
Priority: Blocker
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5152.patch


 a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
 output from and FST (BytesRef) which caused sideffects in later execution. 
 I added an assertion into the FST that checks if a cached root arc is 
 modified and in-fact this happens for instance in our MemoryPostingsFormat 
 and I bet we find more places. We need to think about how to make this less 
 trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-5152.patch

Previous design put much stress on decoding of Outputs. 
This becomes disaster for wildcard queries: like for f*nd, 
we usually have to walk to the last character of FST, then
find that it is not 'd' and automaton doesn't accept this.
In this case, TempFST is actually iterating all the result
of f*, which decodes all the metadata for them...

So I'm trying another approach, the main idea is to load 
metadata  stats as lazily as possible. 
Here I use FSTLong as term index, and leave all other stuff 
in a single term block. The term index FST holds the relationship 
between Term, Ord, and in the term block we can maintain a skip list
for find related metadata  stats.

It is a little similar to BTTR now, and we can someday control how much
data to keep memory resident (e.g. keep stats in memory but metadata on 
disk, however this should be another issue).
Another good part is, it naturally supports seek by ord.(ah, 
actually I don't understand where it is used).

Tests pass, and intersect is not implemented yet.
perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd:

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  373.80  (0.0%)  320.30  (0.0%)  
-14.3% ( -14% -  -14%)
  Fuzzy1   43.82  (0.0%)   47.10  (0.0%)
7.5% (   7% -7%)
 Prefix3  399.62  (0.0%)  433.95  (0.0%)
8.6% (   8% -8%)
  Fuzzy2   14.26  (0.0%)   15.95  (0.0%)   
11.9% (  11% -   11%)
 Respell   40.69  (0.0%)   46.29  (0.0%)   
13.8% (  13% -   13%)
Wildcard   83.44  (0.0%)   96.54  (0.0%)   
15.7% (  15% -   15%)
{noformat}

perf hit on pklookup should be sane, since I haven't optimize the skip list.

I'll update intersect() later, and later we'll cutover to 
PagedBytes  PackedLongBuffer.


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-5152.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: (was: LUCENE-5152.patch)

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, revive IntersectTermsEnum in TempFSTOrd.

Mike, since we already have an intersect() impl, maybe we can still keep this? 
By the way, it is easy to migrate from TempFST to TempFSTOrd.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5138) Update source file attributes

2013-07-26 Thread Han Jiang (JIRA)
Han Jiang created LUCENE-5138:
-

 Summary: Update source file attributes
 Key: LUCENE-5138
 URL: https://issues.apache.org/jira/browse/LUCENE-5138
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Priority: Minor
 Fix For: 5.0, 4.5


Currently we have many java files with executable attribute, 
while some scripts that generate source files are missing this.

Maybe we should clean this up?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5138) Update source file attributes

2013-07-26 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5138:
--

Attachment: LUCENE-5138.patch

Patch, created by:

{noformat}
find -executable -type f -name *.java -exec svn propdel svn:executable {} \;
{noformat}

Since our builder is going to regenerate
source files soon, maybe it is ok to keep 
the executable bit missing for those scripts?

 Update source file attributes
 -

 Key: LUCENE-5138
 URL: https://issues.apache.org/jira/browse/LUCENE-5138
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Priority: Minor
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5138.patch


 Currently we have many java files with executable attribute, 
 while some scripts that generate source files are missing this.
 Maybe we should clean this up?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-5138) Update source file attributes

2013-07-26 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang closed LUCENE-5138.
-

Resolution: Fixed

 Update source file attributes
 -

 Key: LUCENE-5138
 URL: https://issues.apache.org/jira/browse/LUCENE-5138
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Han Jiang
Priority: Minor
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5138.patch


 Currently we have many java files with executable attribute, 
 while some scripts that generate source files are missing this.
 Maybe we should clean this up?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Upload patch: implemented IntersectEnum.next()  seekCeil()
lots of nocommits, but passed all tests

The main idea is to run a DFS on FST, and backtrack as early as
possible (i.e. when we see this label is rejected by automaton)

For this version, there is one explicit perf overhead: I use a 
real stack here, which can be replaced by a Frame[] to reuse objects.

There're several aspects I didn't dig deep: 

* currently, CompiledAutomaton provides a commonSuffixRef, but how
  can we make use of it in FST?
* the DFS is somewhat a 'goto' version, i.e, we can make the code 
  cleaner with a single while-loop similar to BFS search. 
  However, since FST doesn't always tell us how may arcs are leaving 
  current arc, we have problem dealing with this...
* when FST is large enough, the next() operation will takes much time
  doing the linear arc read, maybe we should make use of 
  CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717911#comment-13717911
 ] 

Han Jiang commented on LUCENE-3069:
---

bq. You should not need to .getPosition / .setPosition on the fstReader:

Oh, yes! I'll fix.

bq. I think we can't really make use of it, which is fine (it's an optional 
optimization).

OK, actually I was quite curious why we don't make use of commonPrefixRef 
in CompiledAutomaton. Maybe we can determinize the input Automaton first, then
get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't
always long enough to take into consideration?

bq. But this can only be done if that FST node's arcs are array'd right?

Yes, array arcs only, and we might need methods like advance(label) to do the 
search,
and here gossip search might work better than traditional binary search.

{quote}
Separately, supporting ord w/ FST terms dict should in theory be not
so hard; you'd need to use getByOutput to seek by ord. Maybe (later,
eventually) we can make this a write-time option. We should open a
separate issue ...
{quote}

Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state?
We always have to start from first arc during every seek. However, I'm 
not sure in what kinds of usecase we need the ord information.


I'll commit current version first, so we can iterate.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 5.0, 4.5

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: for those of you using gmail...

2013-07-17 Thread Han Jiang
On Wed, Jul 17, 2013 at 10:26 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Can you try this search in your gmail:

 from:jenk...@thetaphi.de regression build 6605

 And let me know if you get 1 or 0 results back?


Yes, 0 results here. 1 result when remove 'regression'.

And seems that it returns no result for query:

  from:jenk...@thetaphi.de subject:build 6605  ANY_WORD_NOT_IN_TITLE

Maybe for some mails, only title field are taken into consideration?


 I get 0 results back but I should get 1, I think.

 Furthermore, if I search for:

 from:jenk...@thetaphi.de regression

 I only get results up to Jul 2, even though there are many build
 failures after that.

A recent search got results up to #6530. Still no 6605.



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch: revert hashCode()

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang commented on LUCENE-3069:
---

bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to 
nodes can be 'merged'.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to 
nodes can be 'merged'.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:35 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 4:09 PM:


bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.

By the way, for real data, when two outputs are not 'NO_OUTPUT', even they 
contains the same metadata + stats, 
it seems to be very seldom that their arcs can be identical on FST (increases 
less than 1MB for wikimedium1m if 
equals always return false for non-singleton argument). Therefore... yes, 
hashCode() isn't necessary here.

  was (Author: billy):
bq. I think we should assert that the seekCeil returned SeekStatus.FOUND?

Ok! I'll commit that.

bq. useCache is an ancient option from back when we had a terms dict cache

Yes, I suppose is is not 'clear' to have this parameter.

bq. seekExact is working as it should I think.

Currently, I think those 'seek' methods are supposed to change the enum pointer 
based on
input term string, and fetch related metadata from term dict. 

However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState 
to enum, which 
doesn't actually operate 'seek' on dictionary. 

bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?

bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF?

Oops! thanks, nice catch!

bq. It doesn't impl equals (must it really impl hashCode?)

-Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two 
fst nodes can be 'merged'.-
Oops, I forgot it still relys on equals to make sure two instance really 
matches, ok, I'll add that.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash:128).

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708638#comment-13708638
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/15/13 5:08 PM:


Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash.java:128).

Although later, for nodes with outputs, it'll hardly 
find a same node from hashtable.

  was (Author: billy):
Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash:128).
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
 LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang commented on LUCENE-3069:
---

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:04 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. 
Considering different bit size, for df+ttf encoding, totally 
it saves 57.3MB from 148.7MB, using following estimation:


{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:00 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, --I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. 
Considering different bit size, for df+ttf encoding, totally 
it saves 57.3MB from 148.7MB, using following estimation:


{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM:
-

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, -I think the 
index size we can reduce is about 67.5MB- 
-(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah 
I forgot we already steals bit for this case in Lucene41PBF.

I'll test this later.

  was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for body field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, --I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.

I'll test this later.
  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: df-ttf-estimate.txt

Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below:
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?


 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707780#comment-13707780
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 4:48 PM:


Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below(KB):
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?


  was (Author: billy):
Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below:
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?

  
 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-12 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: example.png
LUCENE-3069.patch

Uploaded patch, it is the main part of changes I commited to branch3069.

The picture shows current impl of outputs (it is fetched from one field in 
wikimedium5k).

* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)

A single byte flag is used to indicate whether/which fields current outputs 
maintains, 
for PBF with short byte[], this should be enough. Also, for long-tail terms, 
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).


Since TermsEnum is totally based on FSTEnum, the performance of term dict 
should be similar with 
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this 
hurts.


Following is the performance comparison:

{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   48.13  (4.4%)   15.38  (1.0%)  
-68.0% ( -70% -  -65%)
  Fuzzy2   51.30  (5.3%)   17.47  (1.3%)  
-65.9% ( -68% -  -62%)
  Fuzzy1   52.24  (4.0%)   18.50  (1.2%)  
-64.6% ( -67% -  -61%)
Wildcard9.31  (1.7%)6.16  (2.2%)  
-33.8% ( -37% -  -30%)
 Prefix3   23.25  (1.8%)   19.00  (2.2%)  
-18.3% ( -21% -  -14%)
PKLookup  244.92  (3.6%)  225.42  (2.3%)   
-8.0% ( -13% -   -2%)
 LowTerm  295.88  (5.5%)  293.27  (4.8%)   
-0.9% ( -10% -9%)
  HighPhrase   13.62  (6.5%)   13.54  (7.4%)   
-0.6% ( -13% -   14%)
 MedTerm   99.51  (7.8%)   99.19  (7.7%)   
-0.3% ( -14% -   16%)
   MedPhrase  154.63  (9.4%)  154.38 (10.1%)   
-0.2% ( -17% -   21%)
HighTerm   28.25 (10.7%)   28.25 (10.0%)   
-0.0% ( -18% -   23%)
  OrHighHigh   16.83 (13.3%)   16.86 (13.1%)
0.2% ( -23% -   30%)
HighSloppyPhrase9.02  (4.4%)9.03  (4.5%)
0.2% (  -8% -9%)
   LowPhrase6.26  (3.4%)6.27  (4.1%)
0.2% (  -7% -8%)
   OrHighMed   13.73 (13.2%)   13.77 (12.8%)
0.3% ( -22% -   30%)
   OrHighLow   25.65 (13.2%)   25.73 (13.0%)
0.3% ( -22% -   30%)
 MedSloppyPhrase6.63  (2.7%)6.66  (2.7%)
0.5% (  -4% -6%)
  AndHighMed   42.77  (1.8%)   43.13  (1.5%)
0.8% (  -2% -4%)
 LowSloppyPhrase   32.68  (3.0%)   32.96  (2.8%)
0.8% (  -4% -6%)
 AndHighHigh   22.90  (1.2%)   23.18  (0.7%)
1.2% (   0% -3%)
 LowSpanNear   29.30  (2.0%)   29.83  (2.2%)
1.8% (  -2% -6%)
 MedSpanNear8.39  (2.7%)8.56  (2.9%)
2.0% (  -3% -7%)
  IntNRQ3.12  (1.9%)3.18  (6.7%)
2.1% (  -6% -   10%)
  AndHighLow  507.01  (2.4%)  522.10  (2.8%)
3.0% (  -2% -8%)
HighSpanNear5.43  (1.8%)5.60  (2.6%)
3.1% (  -1% -7%)
{noformat}


{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   49.24  (2.7%)   15.51  (1.0%)  
-68.5% ( -70% -  -66%)
  Fuzzy2   52.01  (4.8%)   17.61  (1.4%)  
-66.1% ( -68% -  -63%)
  Fuzzy1   53.00  (4.0%)   18.62  (1.3%)  
-64.9% ( -67% -  -62%)
Wildcard9.37  (1.3%)6.15  (2.1%)  
-34.4% ( -37% -  -31%)
 Prefix3   23.36  (0.8%)   18.96  (2.1%)  
-18.8% ( -21% -  -16%)
   MedPhrase  155.86  (9.8%)  152.34  (9.7%)   
-2.3% ( -19% -   19%)
   LowPhrase6.33  (3.7%)6.23  (4.0%)   
-1.6% (  -8% -6%)
  HighPhrase   13.68  (7.2%)   13.49  (6.8%)   
-1.4% ( -14% -   13%)
   OrHighMed   13.78 (13.0%)   13.68 (12.7%)   
-0.8% ( -23% -   28%)
HighSloppyPhrase9.14  (5.2%)9.07  (3.7%)   
-0.7% (  -9% -8%)
  OrHighHigh   16.87 (13.3%)   16.76 (12.9%)   
-0.6% ( -23% -   29%)
   OrHighLow   25.71 (13.1%)   25.58 (12.8%)   
-0.5% ( -23% -   29

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-12 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706703#comment-13706703
 ] 

Han Jiang edited comment on LUCENE-3069 at 7/13/13 1:42 AM:


Uploaded patch, it is the main part of changes I commited to branch3069.

The picture shows current impl of outputs (it is fetched from one field in 
wikimedium5k).

* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)

A single byte flag is used to indicate whether/which fields current outputs 
maintains, 
for PBF with short byte[], this should be enough. Also, for long-tail terms, 
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).


Since TermsEnum is totally based on FSTEnum, the performance of term dict 
should be similar with 
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this 
hurts.


Following is the performance comparison:

{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   48.13  (4.4%)   15.38  (1.0%)  
-68.0% ( -70% -  -65%)
  Fuzzy2   51.30  (5.3%)   17.47  (1.3%)  
-65.9% ( -68% -  -62%)
  Fuzzy1   52.24  (4.0%)   18.50  (1.2%)  
-64.6% ( -67% -  -61%)
Wildcard9.31  (1.7%)6.16  (2.2%)  
-33.8% ( -37% -  -30%)
 Prefix3   23.25  (1.8%)   19.00  (2.2%)  
-18.3% ( -21% -  -14%)
PKLookup  244.92  (3.6%)  225.42  (2.3%)   
-8.0% ( -13% -   -2%)
 LowTerm  295.88  (5.5%)  293.27  (4.8%)   
-0.9% ( -10% -9%)
  HighPhrase   13.62  (6.5%)   13.54  (7.4%)   
-0.6% ( -13% -   14%)
 MedTerm   99.51  (7.8%)   99.19  (7.7%)   
-0.3% ( -14% -   16%)
   MedPhrase  154.63  (9.4%)  154.38 (10.1%)   
-0.2% ( -17% -   21%)
HighTerm   28.25 (10.7%)   28.25 (10.0%)   
-0.0% ( -18% -   23%)
  OrHighHigh   16.83 (13.3%)   16.86 (13.1%)
0.2% ( -23% -   30%)
HighSloppyPhrase9.02  (4.4%)9.03  (4.5%)
0.2% (  -8% -9%)
   LowPhrase6.26  (3.4%)6.27  (4.1%)
0.2% (  -7% -8%)
   OrHighMed   13.73 (13.2%)   13.77 (12.8%)
0.3% ( -22% -   30%)
   OrHighLow   25.65 (13.2%)   25.73 (13.0%)
0.3% ( -22% -   30%)
 MedSloppyPhrase6.63  (2.7%)6.66  (2.7%)
0.5% (  -4% -6%)
  AndHighMed   42.77  (1.8%)   43.13  (1.5%)
0.8% (  -2% -4%)
 LowSloppyPhrase   32.68  (3.0%)   32.96  (2.8%)
0.8% (  -4% -6%)
 AndHighHigh   22.90  (1.2%)   23.18  (0.7%)
1.2% (   0% -3%)
 LowSpanNear   29.30  (2.0%)   29.83  (2.2%)
1.8% (  -2% -6%)
 MedSpanNear8.39  (2.7%)8.56  (2.9%)
2.0% (  -3% -7%)
  IntNRQ3.12  (1.9%)3.18  (6.7%)
2.1% (  -6% -   10%)
  AndHighLow  507.01  (2.4%)  522.10  (2.8%)
3.0% (  -2% -8%)
HighSpanNear5.43  (1.8%)5.60  (2.6%)
3.1% (  -1% -7%)
{noformat}


{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   49.24  (2.7%)   15.51  (1.0%)  
-68.5% ( -70% -  -66%)
  Fuzzy2   52.01  (4.8%)   17.61  (1.4%)  
-66.1% ( -68% -  -63%)
  Fuzzy1   53.00  (4.0%)   18.62  (1.3%)  
-64.9% ( -67% -  -62%)
Wildcard9.37  (1.3%)6.15  (2.1%)  
-34.4% ( -37% -  -31%)
 Prefix3   23.36  (0.8%)   18.96  (2.1%)  
-18.8% ( -21% -  -16%)
   MedPhrase  155.86  (9.8%)  152.34  (9.7%)   
-2.3% ( -19% -   19%)
   LowPhrase6.33  (3.7%)6.23  (4.0%)   
-1.6% (  -8% -6%)
  HighPhrase   13.68  (7.2%)   13.49  (6.8%)   
-1.4% ( -14% -   13%)
   OrHighMed   13.78 (13.0%)   13.68 (12.7%)   
-0.8% ( -23% -   28%)
HighSloppyPhrase9.14  (5.2%)9.07  (3.7%)   
-0.7% (  -9% -8%)
  OrHighHigh   16.87 (13.3%)   16.76 (12.9%)   
-0.6% ( -23% -   29%)
   OrHighLow   25.71 (13.1

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-12 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707649#comment-13707649
 ] 

Han Jiang commented on LUCENE-3069:
---

bq. Cool idea! I wonder how many of those are df == ttf == 1?

I didn't try a very precise estimation, but the percentage will be large:

For the index of wikimedium1m, the larget segment has a 'body' field with:

{noformat}
bitwidth/7  df==ttf   df
1   1324400 / 1542987
2   110 / 18951
3   0   / 175
4   0   / 0
5   0   / 0
{noformat}

That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' 
since we're using VInt encoding. 
So, for this field, we can save (1324400+110*2) bytes by stealing one bit.

bq. Maybe we could try writing a vInt of 0 for docFreq to indicate that both 
docFreq and totalTermFreq are 1?

Yes, that may helps! I'll try to test the percentage. But still we should note 
that, df is a small part in term dict data.

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Han Jiang
  Labels: gsoc2013
 Fix For: 4.4

 Attachments: example.png, LUCENE-3069.patch


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

2013-06-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-5029:
--

Attachment: LUCENE-5029.patch

This patch keeps the original 'customize termstate in PBF' design. 
It also pushes flushTermsBlock  readTermsBlock to term dict side.

Now the rule is: if you PBF have some monotonical but 'don't care' values,
always fill -1 on them, so that term dict will reuse previous values to
'pad' that -1s. Yes Mike, the algebra is really simple :)

But I still have a problem removing that termBlockOrd from BlockTermState:
every time a caller uses seekExact(), it is expected to get a new term
state in which 'termBlockOrd' is involved. However I cannot fully 
understand how this variable works, and maybe we can use metadataUpto
to replace this? I'll try this later.

Can you put the TestDrillSideway fix in lucene3069 branch as well? 
Thanks :)


 factor out a generic 'TermState' for better sharing in FST-based term dict
 --

 Key: LUCENE-5029
 URL: https://issues.apache.org/jira/browse/LUCENE-5029
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Han Jiang
Assignee: Han Jiang
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
 LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, 
 LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch


 Currently, those two FST-based term dict (memory codec  blocktree) all use 
 FSTBytesRef as a base data structure, this might not share much data in 
 parent arcs, since the encoded BytesRef doesn't guarantee that 
 'Outputs.common()' always creates a long prefix. 
 While for current postings format, it is guaranteed that each FP (pointing to 
 .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
 means, between two Outputs, the Outputs from smaller term can be safely 
 pushed towards root. However we always have some tricky TermState to deal 
 with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
 can simply cut the whole TermState into two parts: one part for comparation 
 and intersection, another for restoring generic data. Then the data structure 
 will be clear: this generic 'TermState' will consist of a fixed-length 
 LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

2013-06-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang resolved LUCENE-5029.
---

Resolution: Fixed

PostingsBase is now pluggable for non-based term dict, 
and the introduction of long[] and byte[] naturally helps 
the delta-encoding in both block-based term dict, and 
FST-based term dict.

 factor out a generic 'TermState' for better sharing in FST-based term dict
 --

 Key: LUCENE-5029
 URL: https://issues.apache.org/jira/browse/LUCENE-5029
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Han Jiang
Assignee: Han Jiang
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
 LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, 
 LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch


 Currently, those two FST-based term dict (memory codec  blocktree) all use 
 FSTBytesRef as a base data structure, this might not share much data in 
 parent arcs, since the encoded BytesRef doesn't guarantee that 
 'Outputs.common()' always creates a long prefix. 
 While for current postings format, it is guaranteed that each FP (pointing to 
 .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
 means, between two Outputs, the Outputs from smaller term can be safely 
 pushed towards root. However we always have some tricky TermState to deal 
 with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
 can simply cut the whole TermState into two parts: one part for comparation 
 and intersection, another for restoring generic data. Then the data structure 
 will be clear: this generic 'TermState' will consist of a fixed-length 
 LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   >