Re: Welcome Karl Wright as a Lucene/Solr committer!
Welcome Karl! On Mon, Apr 4, 2016 at 3:40 PM, Karl Wright <daddy...@gmail.com> wrote: > Hi all, > > Professionally, I've been active in software development since the > 1970's. My interests include many things related to software development, > as well as areas as varied as geology, carpentry, and gardening. I'm the > PMC chair for the ManifoldCF project, as well as a committer on other > Apache projects such as Http Components. > > My current employer is HERE, Inc, who is a spin-off from Nokia, who sells > map data, services, and search capabilities. > > I'm also the contributor and principal author of the Geo3D package, which > is now part of Lucene under the spatial3d module. I intend to continue to > contribute to this package for the foreseeable future. > > Thanks!! > Karl > > > On Mon, Apr 4, 2016 at 10:28 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> I'm pleased to announce that Karl Wright has accepted the Lucene PMC's >> invitation to become a committer. >> >> Karl, it's tradition that you introduce yourself with a brief bio. >> >> Karma has been granted to your pre-existing account, so that you can >> add yourself to the committers section of the Who We Are page on the >> website: http://lucene.apache.org/whoweare.html >> >> Congratulations and welcome! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> > > -- Han Jiang
Re: Welcome Dennis Gove as Lucene/Solr committer
Welcome Dennis! On Fri, Nov 6, 2015 at 3:19 PM, Joel Bernstein <joels...@gmail.com> wrote: > I'm pleased to announce that Dennis Gove has accepted the PMC's > invitation to become a committer. > > Dennis, it's tradition that you introduce yourself with a brief bio. > > Your account is not entirely ready yet. We will let you know when it is > created > and karma has been granted so that you can add yourself to the committers > section of the Who We Are page on the website: > <http://lucene.apache.org/whoweare.html>. > > Congratulations and welcome! > > > Joel Bernstein > http://joelsolr.blogspot.com/ > -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Nick Knize as Lucene/Solr committer
Welcome Nick! On Wed, Oct 21, 2015 at 12:50 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > I'm pleased to announce that Nick Knize has accepted the PMC's > invitation to become a committer. > > Nick, it's tradition that you introduce yourself with a brief bio / > origin story, explaining how you arrived here. > > Your handle "nknize" has already added to the “lucene" LDAP group, so > you now have commit privileges. > > Please celebrate this rite of passage, and confirm that the right > karma has in fact enabled, by embarking on the challenge of adding > yourself to the committers section of the Who We Are page on the > website: http://lucene.apache.org/whoweare.html (use the ASF CMS > bookmarklet > at the bottom of the page here: https://cms.apache.org/#bookmark - > more info here http://www.apache.org/dev/cms.html). > > Congratulations and welcome! > > Mike McCandless > > http://blog.mikemccandless.com > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Christine Poerschke as Lucene/Solr committer
Welcome, Christine! On Fri, Jul 24, 2015 at 3:27 PM, Adrien Grand jpou...@gmail.com wrote: I'm pleased to announce that Christine Poerschke has accepted the PMC's invitation to become a committer. Christine, it's tradition that you introduce yourself with a brief bio. Your account is not entirely ready yet. We will let you know when it is created and karma has been granted so that you can add yourself to the committers section of the Who We Are page on the website: http://lucene.apache.org/whoweare.html. Congratulations and welcome! -- Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Mikhail Khludnev as Lucene/Solr committer
Welcome, Mikhail! On Tue, Jul 21, 2015 at 3:21 PM, Adrien Grand jpou...@gmail.com wrote: I'm pleased to announce that Mikhail Khludnev has accepted the PMC's invitation to become a committer. Mikhail, it's tradition that you introduce yourself with a brief bio. Your handle mkhl has already added to the “lucene LDAP group, so you now have commit privileges. Please test this by adding yourself to the committers section of the Who We Are page on the website: http://lucene.apache.org/whoweare.html (use the ASF CMS bookmarklet at the bottom of the page here: https://cms.apache.org/#bookmark - more info here http://www.apache.org/dev/cms.html). Congratulations and welcome! -- Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Upayavira as Lucene/Solr committer
Welcome, Upayavira! On Tue, Jun 23, 2015 at 3:02 AM, Steve Rowe sar...@gmail.com wrote: I'm pleased to announce that Upayavira has accepted the PMC's invitation to become a committer. Upayavira, it's tradition that you introduce yourself with a brief bio. Mike McCandless, the Lucene PMC chair, has already added your “upayavira account to the “lucene LDAP group, so you now have commit privileges. Please test this by adding yourself to the committers section of the Who We Are page on the website: http://lucene.apache.org/whoweare.html (use the ASF CMS bookmarklet at the bottom of the page here: https://cms.apache.org/#bookmark - more info here http://www.apache.org/dev/cms.html). Congratulations and welcome! Steve - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Gregory Chanan as Lucene/Solr committer
Welcome Gregory! On Sat, Sep 20, 2014 at 9:26 AM, Ryan Ernst r...@iernst.net wrote: Welcome Gregory! On Sep 19, 2014 3:33 PM, Steve Rowe sar...@gmail.com wrote: I'm pleased to announce that Gregory Chanan has accepted the PMC's invitation to become a committer. Gregory, it's tradition that you introduce yourself with a brief bio. Mark Miller, the Lucene PMC chair, has already added your gchanan account to the lucene LDAP group, so you now have commit privileges. Please test this by adding yourself to the committers section of the Who We Are page on the website: http://lucene.apache.org/whoweare.html (use the ASF CMS bookmarklet at the bottom of the page here: https://cms.apache.org/#bookmark - more info here http://www.apache.org/dev/cms.html). Since you're a committer on the Apache HBase project, you probably already know about it, but I'll include a link to the ASF dev page anyway - lots of useful links: http://www.apache.org/dev/. Congratulations and welcome! Steve - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5841) Remove FST.Builder.FreezeTail interface
[ https://issues.apache.org/jira/browse/LUCENE-5841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071563#comment-14071563 ] Han Jiang commented on LUCENE-5841: --- It is really great to see this interface removed! Remove FST.Builder.FreezeTail interface --- Key: LUCENE-5841 URL: https://issues.apache.org/jira/browse/LUCENE-5841 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0, 4.10 Attachments: LUCENE-5841.patch The FST Builder has a crazy-hairy interface called FreezeTail, which is only used by BlockTreeTermsWriter to find appropriate prefixes (i.e. containing enough terms or sub-blocks) to write term blocks. But this is really a silly abuse ... it's cleaner and likely faster/less GC for BTTW to compute this itself just by tracking the term ordinal where each prefix started in the pending terms/blocks. The code is also insanely hairy, and this is at least a baby step to try to make it a bit simpler. This also makes it very hard to experiment with different formats at write-time because you have to get your new formats working through this strange FreezeTail. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang closed LUCENE-3069. - Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2014 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang resolved LUCENE-3069. --- Resolution: Fixed Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang closed LUCENE-3069. - Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang reopened LUCENE-3069: --- Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Labels: gsoc2013 (was: gsoc2014) Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937095#comment-13937095 ] Han Jiang commented on LUCENE-3069: --- Had to reopen it because jira doesn't permit label change :) Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Anshum Gupta as Lucene/Solr Committer!
Welcome Anshum! On Mon, Feb 17, 2014 at 6:33 AM, Mark Miller markrmil...@gmail.com wrote: Hey everybody! The Lucene PMC is happy to welcome Anshum Gupta as a committer on the Lucene / Solr project. Anshum has contributed to a number of issues for the project, especially around SolrCloud. Welcome Anshum! It's tradition to introduce yourself with a short bio :) -- - Mark http://about.me/markrmiller -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Resolved] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang resolved LUCENE-3069. --- Resolution: Fixed Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2014 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886256#comment-13886256 ] Han Jiang commented on LUCENE-3069: --- Thanks Mike! Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2014 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Benson Margulies as Lucene/Solr committer!
Welcome Benson! On Sun, Jan 26, 2014 at 6:10 AM, Benson Margulies bimargul...@gmail.comwrote: Hello Lucene development community, it's a pleasure to be welcomed aboard. In my view, the significant aspect of my bio is that I've been implementing things that go into or around Lucene for many years now. During the 'day', I'm the CTO of a company that works in the area of text analytics. We build Tokenizers and TokenFilters to allow our users to integrate our components into Lucene, and we've used Lucene and Solr as components of NLP devices that search on a large scale. So I have an abiding interest in the analysis chain and in the intersection of NLP and search. Elsewhere in Apache, I'm an active Maven dev, a semi-retired CXF dev, and a sort of uncle of several other projects. So I'm prone to be helpful or annoying with issues of Maven and Web Services. Thanks again, benson p.s. I think Uwe has already added me to the necessary wiring; would some kind soul please point me to the explanation of how the web site is maintained so I can add myself? Is it just the ASF CMS? On Sat, Jan 25, 2014 at 4:40 PM, Michael McCandless luc...@mikemccandless.com wrote: I'm pleased to announce that Benson Margulies has accepted to join our ranks as a committer. Benson has been involved in a number of Lucene/Solr issues over time (see http://jirasearch.mikemccandless.com/search.py?index=jirachg=ddsa1=allUsersa2=Benson+Margulies ), most recently on debugging tricky analysis issues. Benson, it is tradition that you introduce yourself with a brief bio. I know you're heavily involved in other Apache projects already... Once your account is set up, you should then be able to add yourself to the who we are page on the website as well. Congratulations and welcome! Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879423#comment-13879423 ] Han Jiang commented on LUCENE-3069: --- Thanks for catching this Mike! I wasn't quick to get that username :p Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.7 Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, example.png FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Areek Zillur as Lucene/Solr committer!
Congratulations and welcome Areek! On Wed, Jan 22, 2014 at 12:57 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Welcome Areek! On Wed, Jan 22, 2014 at 2:00 AM, Areek Zillur areek...@gmail.com wrote: Thanks Robert! I am very pleased to be a committer to Lucene/Solr! I am originally from Dhaka, Bangladesh. I am currently a 4th year Computer Engineering student at University of Waterloo in Canada. I was fortunate enough to have multiple internships all over North America through the university's co-op program. I was first introduced to Lucene/Solr in one of my work-terms at A9 and loved it. I really enjoy the open-source development and the friendliness of the community behind Lucene/Solr. In my free time, I enjoy working on working on my recreational algorithmic trading system and learning new programming languages. I hope to continue to work on Lucene/Solr and learn a lot more from the community! Thanks, Areek Zillur On Tue, Jan 21, 2014 at 11:41 AM, Yonik Seeley yo...@heliosearch.com wrote: Welcome Areek! -Yonik http://heliosearch.com -- making solr shine On Tue, Jan 21, 2014 at 2:26 PM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that Areek Zillur has accepted to join our ranks as a committer. Areek has been improving suggester support in Lucene and Solr, including a revamped Solr component slated for the 4.7 release. [1] Areek, it is tradition that you introduce yourself with a brief bio. Once your account is setup, you should then be able to add yourself to the who we are page on the website as well. Congratulations and welcome! [1] https://issues.apache.org/jira/browse/SOLR-5378 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Regards, Shalin Shekhar Mangar. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Commented] (LUCENE-5376) Add a demo search server
[ https://issues.apache.org/jira/browse/LUCENE-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855634#comment-13855634 ] Han Jiang commented on LUCENE-5376: --- +1, it will be great to have an 'active' demo to show the features :) I think we should remove those hardcoded classpaths, e.g. in post.py:30? And will this demo be expected to be the same as jirasearch? Will we need further configuration to get the demo webside working? For example I cannot find search.py in the sourcecodes. Add a demo search server Key: LUCENE-5376 URL: https://issues.apache.org/jira/browse/LUCENE-5376 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: lucene-demo-server.tgz I think it'd be useful to have a demo search server for Lucene. Rather than being fully featured, like Solr, it would be minimal, just wrapping the existing Lucene modules to show how you can make use of these features in a server setting. The purpose is to demonstrate how one can build a minimal search server on top of APIs like SearchManager, SearcherLifetimeManager, etc. This is also useful for finding rough edges / issues in Lucene's APIs that make building a server unnecessarily hard. I don't think it should have back compatibility promises (except Lucene's index back compatibility), so it's free to improve as Lucene's APIs change. As a starting point, I'll post what I built for the eating your own dog food search app for Lucene's Solr's jira issues http://jirasearch.mikemccandless.com (blog: http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html ). It uses Netty to expose basic indexing searching APIs via JSON, but it's very rough (lots nocommits). -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1548830 - /lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFSLockFactory.java
OK! thanks for the fix! On Sat, Dec 7, 2013 at 7:21 PM, Uwe Schindler u...@thetaphi.de wrote: Hi Han, I committed an even better fix for this, using the native javadocs linking instead of an HTML link. By that its automatically also correct in branch_4x (where the docs have to go to Java 6 SE javadocs). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: h...@apache.org [mailto:h...@apache.org] Sent: Saturday, December 07, 2013 11:34 AM To: comm...@lucene.apache.org Subject: svn commit: r1548830 - /lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleF SLockFactory.java Author: han Date: Sat Dec 7 10:34:21 2013 New Revision: 1548830 URL: http://svn.apache.org/r1548830 Log: broken link police Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS LockFactory.java Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS LockFactory.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/ apache/lucene/store/SimpleFSLockFactory.java?rev=1548830r1=1548829r 2=1548830view=diff == --- lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS LockFactory.java (original) +++ lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/store/SimpleFS LockFactory.java Sat Dec 7 10:34:21 2013 @@ -25,7 +25,7 @@ import java.io.IOException; * File#createNewFile()}./p * * pbNOTE:/b the a target=_top - * href= http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#createNew File()javadocs + * href= http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createNe wFile()javadocs * for codeFile.createNewFile/code/a contain a vague * yet spooky warning about not using the API for file * locking. This warning was added due to a target=_top - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: [VOTE] Lucene / Solr 4.6.0
Wow, congratulations Uwe! On Fri, Nov 15, 2013 at 4:11 AM, Uwe Schindler u...@thetaphi.de wrote: The PMC Chair is going to marry tomorrow... Simon has to come here and not do new RCs! :) In any case, thanks for doing the release, Simon. I will do the next! Uwe Simon Willnauer simon.willna...@gmail.com schrieb: Thanks Steve I won't get to this until next week. I will upload a new RC on monday. Simon Sent from my iPhone On 14 Nov 2013, at 20:20, Steve Rowe sar...@gmail.com wrote: I’ve committed fixes, to lucene_solr_4_6 as well as to branch_4x and trunk, for all the problems I mentioned. The first revision including all these is 1542030. Steve On Nov 14, 2013, at 1:16 PM, Steve Rowe sar...@gmail.com wrote: -1 Smoke tester passes. Solr Changes look good, except that the “Upgrading from Solr 4.5.0” section” follows “Detailed Change List”, but should be above it; and one change attribution didn’t get recognized because it’s missing parens: Elran Dvir via Erick Erickson. Definitely not worth a respin in either case. Lucene Changes look good, except that the “API Changes” section in Changes.html is formatted as an item in the “Bug Fixes” section, rather than its own section. I’ll fix. (The issue is that “API Changes:” in CHANGES.txt has a trailing colon - the section name regex should allow this. ) This is probably not worth a respin. Lucene and Solr Documentation pages look good, except that the File Formats” link from the Lucene Documentation page leads to the 4.5 format doc, rather than the 4.6 format doc (Lucene46Codec was introduced by LUCENE-5215). This is respin-worthy. Updating this is not automated now - it’s hard-coded in lucene/site/xsl/index.xsl - the default codec doesn’t change in every release. I’ll try to automate extracting the default from o.a.l.codecs.Codec#defaultCodec [ = Codec.forName(“Lucene46”)]. Lucene and Solr Javadocs look good. Steve On Nov 14, 2013, at 4:37 AM, Simon Willnauer simon.willna...@gmail.com wrote: Please vote for the first Release Candidate for Lucene/Solr 4.6.0 you can download it here: http://people.apache.org/~simonw/staging_area/lucene-solr-4.6.0-RC1-rev1541686 or run the smoke tester directly with this commandline (don't forget to set JAVA6_HOME etc.): python3.2 -u dev-tools/scripts/smokeTestRelease.py http://people.apache.org/~simonw/staging_area/lucene-solr-4.6.0-RC1-rev1541686 1541686 4.6.0 /tmp/smoke_test_4_6 I integrated the RC into Elasticsearch and all tests pass: https://github.com/s1monw/elasticsearch/commit/765e3194bb23f202725bfb28d9a2fd7cc71b49de Smoketester said: SUCCESS! [1:15:57.339272] here is my +1 Simon -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Welcome Ryan Ernst as Lucene/Solr committer
Welcome, Ryan! On Tue, Oct 15, 2013 at 2:13 AM, Ryan Ernst r...@iernst.net wrote: Thanks Adrian. I grew up in Bakersfield, CA (colloquially known as the armpit of California). I escaped and went to Cal Poly for my bachelors in computer science, and after a very brief stint working on HPUX, I landed working on the Amazon search engine for A9. I especially enjoy working with compression and encodings, and hope to experiment there some more with Lucene. Thanks Ryan On Mon, Oct 14, 2013 at 10:27 AM, Adrien Grand jpou...@gmail.com wrote: I'm pleased to announce that Ryan Ernst has accepted to join our ranks as a committer. Ryan has been working on a number of Lucene and Solr issues and recently contributed the new expressions module[1] which allows for compiling javascript expressions into SortField instances with excellent performance since it doesn't rely on a scripting engine but directly generates Java bytecode. This is a very exciting change which will be available in Lucene 4.6. Ryan, it is tradition that you introduce yourself with a brief bio. Congratulations and welcome! [1] https://issues.apache.org/jira/browse/LUCENE-5207 -- Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Commented] (LUCENE-5268) Cutover more postings formats to the inverted pull API
[ https://issues.apache.org/jira/browse/LUCENE-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791278#comment-13791278 ] Han Jiang commented on LUCENE-5268: --- +1, the pulsing code is much cleaner! Cutover more postings formats to the inverted pull API Key: LUCENE-5268 URL: https://issues.apache.org/jira/browse/LUCENE-5268 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0 Attachments: LUCENE-5268.patch In LUCENE-5123, we added a new, more flexible, pull API for writing postings. This API allows the postings format to iterate the fields/terms/postings more than once, and mirrors the API for writing doc values. But that was just the first step (only SimpleText was cutover to the new API). I want to cutover more components, so we can (finally) e.g. play with different encodings depending on the term's postings, such as using a bitset for high freq DOCS_ONLY terms (LUCENE-5052). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1530537 - in /lucene/dev/trunk/lucene: common-build.xml ivy-settings.xml
oh, yes, I'll do that! On Wed, Oct 9, 2013 at 5:17 PM, Robert Muir rcm...@gmail.com wrote: Thanks for updating this! I think we should merge this back to branch 4.x too? This way the source code tar.gz is working from China for our next release? 2013/10/9 h...@apache.org: Author: han Date: Wed Oct 9 08:56:15 2013 New Revision: 1530537 URL: http://svn.apache.org/r1530537 Log: update broken links for maven mirror Modified: lucene/dev/trunk/lucene/common-build.xml lucene/dev/trunk/lucene/ivy-settings.xml Modified: lucene/dev/trunk/lucene/common-build.xml URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/common-build.xml?rev=1530537r1=1530536r2=1530537view=diff == --- lucene/dev/trunk/lucene/common-build.xml (original) +++ lucene/dev/trunk/lucene/common-build.xml Wed Oct 9 08:56:15 2013 @@ -360,7 +360,7 @@ property name=ivy_install_path location=${user.home}/.ant/lib / property name=ivy_bootstrap_url1 value= http://repo1.maven.org/maven2/ !-- you might need to tweak this from china so it works -- - property name=ivy_bootstrap_url2 value= http://mirror.netcologne.de/maven2/ + property name=ivy_bootstrap_url2 value=http://uk.maven.org/maven2 / property name=ivy_checksum_sha1 value=c5ebf1c253ad4959a29f4acfe696ee48cdd9f473/ target name=ivy-availability-check unless=ivy.available Modified: lucene/dev/trunk/lucene/ivy-settings.xml URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/ivy-settings.xml?rev=1530537r1=1530536r2=1530537view=diff == --- lucene/dev/trunk/lucene/ivy-settings.xml (original) +++ lucene/dev/trunk/lucene/ivy-settings.xml Wed Oct 9 08:56:15 2013 @@ -35,7 +35,7 @@ ibiblio name=maven.restlet.org root=http://maven.restlet.org; m2compatible=true / !-- you might need to tweak this from china so it works -- -ibiblio name=working-chinese-mirror root= http://mirror.netcologne.de/maven2; m2compatible=true / +ibiblio name=working-chinese-mirror root= http://uk.maven.org/maven2; m2compatible=true / !-- temporary to try Clover 3.2.0 snapshots, see https://issues.apache.org/jira/browse/LUCENE-5243, https://jira.atlassian.com/browse/CLOV-1368 -- ibiblio name=atlassian-clover-snapshots root= https://maven.atlassian.com/content/repositories/atlassian-public-snapshot; m2compatible=true / - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Joel Bernstein
Welcome Joel! On Thu, Oct 3, 2013 at 1:24 PM, Grant Ingersoll gsing...@apache.org wrote: Hi, The Lucene PMC is happy to welcome Joel Bernstein as a committer on the Lucene and Solr project. Joel has been working on a number of issues on the project and we look forward to his continued contributions going forward. Welcome aboard, Joel! -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec
Hi Paul, Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant clean' ? On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot paul.j.elsc...@gmail.comwrote: Dear readers, When I update my working copy of lucene core trunk to current latest rev 1527154, ant compile-test fails with this message: ... lucene/trunk/lucene/core/src/**test/org/apache/lucene/index/** TestNumericDocValuesUpdates.**java:17: error: cannot find symbol [javac] import org.apache.lucene.codecs.**lucene45.Lucene45RWCodec; After updating (backdating) to the 26th: svn update -r {20130926} ant compile-test works normally. I couldn't decide on which issue to post this, so here it is. Regards, Paul Elschot --**--**- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec
Paul, a quick hack (e.g. for your development), is to run the ant command in /lucene instead of /lucene/core, I don't know why that fails, and hope someone can explain this, though :) On Sun, Sep 29, 2013 at 9:54 PM, Paul Elschot paul.j.elsc...@gmail.comwrote: Hi Han, I just reproduced it three times on my working copy in the directory trunk/lucene/core with ant clean in the command sequence: svn update -r {20130926} ant clean ant compile-test # build successful svn update -r 1527154 ant clean ant compile-test # build failed In my working copy svn status currently produces this: M src/java/org/apache/lucene/util/packed/EliasFanoDecoder.java M src/java/org/apache/lucene/util/packed/EliasFanoDocIdSet.java M src/java/org/apache/lucene/util/packed/EliasFanoEncoder.java M src/test/org/apache/lucene/util/packed/TestEliasFanoSequence.java and I don't expect these have an influence. To be complete, on revision 1526316 (of the 26th) I also got this output (slightly edited) once: compile-core: [mkdir] Created dir: lucene/trunk/lucene/build/core/classes/java [javac] Compiling 672 source files to ... lucene/trunk/lucene/build/core/classes/java [javac] An exception has occurred in the compiler (1.7.0_21). Please file a bug at the Java Developer Connection ( http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you. [javac] java.lang.AbstractMethodError I could not reproduce that one with four more attempts, so I do hope that that was a one time glitch. But it is strange that on my machine: javac -version produces: javac 1.7.0_40 and the compiler exception message above reports 1.7.0_21. Perhaps there is something wrong with my java/javac setup, any advice there? Regards, Paul Elschot P.S. java -version produces: java version 1.7.0_40 Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) Server VM (build 24.0-b56, mixed mode) On 29-09-13 15:11, Han Jiang wrote: Hi Paul, Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant clean' ? On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot paul.j.elsc...@gmail.comwrote: Dear readers, When I update my working copy of lucene core trunk to current latest rev 1527154, ant compile-test fails with this message: ... lucene/trunk/lucene/core/src/test/org/apache/lucene/index/TestNumericDocValuesUpdates.java:17: error: cannot find symbol [javac] import org.apache.lucene.codecs.lucene45.Lucene45RWCodec; After updating (backdating) to the 26th: svn update -r {20130926} ant compile-test works normally. I couldn't decide on which issue to post this, so here it is. Regards, Paul Elschot - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
Re: Core trunk compile-test fails on rev 1527154, TestNumericDocValuesUpdates, Lucene45RWCodec
Shai, I think we should change TestRulSetupAndRestoreClassEnv, I'll upload a patch for this. On Sun, Sep 29, 2013 at 10:18 PM, Shai Erera ser...@gmail.com wrote: I'm not sure but maybe it is related to the fact you run it from lucene/core. Since on LUCENE-5215 (rev 1527154) I created a new Lucene46Codec, and moved Lucene45 stuff under test-framework, as well as changed Lucene45Codec.fieldInfosFormat to not be final, perhaps you need to run 'ant clean' from the root to make sure all changes are compiled accordingly? Out of curiosity, did you 'svn up' from root, or perhaps from lucene/core by accident? Shai On Sun, Sep 29, 2013 at 4:54 PM, Paul Elschot paul.j.elsc...@gmail.comwrote: Hi Han, I just reproduced it three times on my working copy in the directory trunk/lucene/core with ant clean in the command sequence: svn update -r {20130926} ant clean ant compile-test # build successful svn update -r 1527154 ant clean ant compile-test # build failed In my working copy svn status currently produces this: M src/java/org/apache/lucene/util/packed/EliasFanoDecoder.java M src/java/org/apache/lucene/util/packed/EliasFanoDocIdSet.java M src/java/org/apache/lucene/util/packed/EliasFanoEncoder.java M src/test/org/apache/lucene/util/packed/TestEliasFanoSequence.java and I don't expect these have an influence. To be complete, on revision 1526316 (of the 26th) I also got this output (slightly edited) once: compile-core: [mkdir] Created dir: lucene/trunk/lucene/build/core/classes/java [javac] Compiling 672 source files to ... lucene/trunk/lucene/build/core/classes/java [javac] An exception has occurred in the compiler (1.7.0_21). Please file a bug at the Java Developer Connection ( http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you. [javac] java.lang.AbstractMethodError I could not reproduce that one with four more attempts, so I do hope that that was a one time glitch. But it is strange that on my machine: javac -version produces: javac 1.7.0_40 and the compiler exception message above reports 1.7.0_21. Perhaps there is something wrong with my java/javac setup, any advice there? Regards, Paul Elschot P.S. java -version produces: java version 1.7.0_40 Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) Server VM (build 24.0-b56, mixed mode) On 29-09-13 15:11, Han Jiang wrote: Hi Paul, Just an FYI, it cannot reproduce on my machine. Maybe... you need 'ant clean' ? On Sun, Sep 29, 2013 at 9:02 PM, Paul Elschot paul.j.elsc...@gmail.comwrote: Dear readers, When I update my working copy of lucene core trunk to current latest rev 1527154, ant compile-test fails with this message: ... lucene/trunk/lucene/core/src/test/org/apache/lucene/index/TestNumericDocValuesUpdates.java:17: error: cannot find symbol [javac] import org.apache.lucene.codecs.lucene45.Lucene45RWCodec; After updating (backdating) to the 26th: svn update -r {20130926} ant compile-test works normally. I couldn't decide on which issue to post this, so here it is. Regards, Paul Elschot - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation
[ https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5215: -- Attachment: LUCENE-5215.patch Patch for the compile error mentioned by Paul. Add support for FieldInfos generation - Key: LUCENE-5215 URL: https://issues.apache.org/jira/browse/LUCENE-5215 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch In LUCENE-5189 we've identified few reasons to do that: # If you want to update docs' values of field 'foo', where 'foo' exists in the index, but not in a specific segment (sparse DV), we cannot allow that and have to throw a late UOE. If we could rewrite FieldInfos (with generation), this would be possible since we'd also write a new generation of FIS. # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the consumer isn't allowed to change FI.attributes because we cannot modify the existing FIS. This is implicit however, and we silently ignore any modified attributes. FieldInfos.gen will allow that too. The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and add support for FIS generation in FieldInfosFormat, SegReader etc., like we now do for DocValues. I'll work on a patch. Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that have same limitation -- if a Codec modifies them, they are silently being ignored, since we don't gen the .si files. I think we can easily solve that by recording SI.attributes in SegmentInfos, so they are recorded per-commit. But I think it should be handled in a separate issue. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation
[ https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5215: -- Attachment: LUCENE-5215.patch Add support for FieldInfos generation - Key: LUCENE-5215 URL: https://issues.apache.org/jira/browse/LUCENE-5215 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch In LUCENE-5189 we've identified few reasons to do that: # If you want to update docs' values of field 'foo', where 'foo' exists in the index, but not in a specific segment (sparse DV), we cannot allow that and have to throw a late UOE. If we could rewrite FieldInfos (with generation), this would be possible since we'd also write a new generation of FIS. # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the consumer isn't allowed to change FI.attributes because we cannot modify the existing FIS. This is implicit however, and we silently ignore any modified attributes. FieldInfos.gen will allow that too. The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and add support for FIS generation in FieldInfosFormat, SegReader etc., like we now do for DocValues. I'll work on a patch. Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that have same limitation -- if a Codec modifies them, they are silently being ignored, since we don't gen the .si files. I think we can easily solve that by recording SI.attributes in SegmentInfos, so they are recorded per-commit. But I think it should be handled in a separate issue. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5215) Add support for FieldInfos generation
[ https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5215: -- Attachment: (was: LUCENE-5215.patch) Add support for FieldInfos generation - Key: LUCENE-5215 URL: https://issues.apache.org/jira/browse/LUCENE-5215 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch In LUCENE-5189 we've identified few reasons to do that: # If you want to update docs' values of field 'foo', where 'foo' exists in the index, but not in a specific segment (sparse DV), we cannot allow that and have to throw a late UOE. If we could rewrite FieldInfos (with generation), this would be possible since we'd also write a new generation of FIS. # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the consumer isn't allowed to change FI.attributes because we cannot modify the existing FIS. This is implicit however, and we silently ignore any modified attributes. FieldInfos.gen will allow that too. The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and add support for FIS generation in FieldInfosFormat, SegReader etc., like we now do for DocValues. I'll work on a patch. Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that have same limitation -- if a Codec modifies them, they are silently being ignored, since we don't gen the .si files. I think we can easily solve that by recording SI.attributes in SegmentInfos, so they are recorded per-commit. But I think it should be handled in a separate issue. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5215) Add support for FieldInfos generation
[ https://issues.apache.org/jira/browse/LUCENE-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781405#comment-13781405 ] Han Jiang commented on LUCENE-5215: --- I guess so :) Add support for FieldInfos generation - Key: LUCENE-5215 URL: https://issues.apache.org/jira/browse/LUCENE-5215 Project: Lucene - Core Issue Type: New Feature Components: core/index Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch, LUCENE-5215.patch In LUCENE-5189 we've identified few reasons to do that: # If you want to update docs' values of field 'foo', where 'foo' exists in the index, but not in a specific segment (sparse DV), we cannot allow that and have to throw a late UOE. If we could rewrite FieldInfos (with generation), this would be possible since we'd also write a new generation of FIS. # When we apply NDV updates, we call DVF.fieldsConsumer. Currently the consumer isn't allowed to change FI.attributes because we cannot modify the existing FIS. This is implicit however, and we silently ignore any modified attributes. FieldInfos.gen will allow that too. The idea is to add to SIPC fieldInfosGen, add to each FieldInfo a dvGen and add support for FIS generation in FieldInfosFormat, SegReader etc., like we now do for DocValues. I'll work on a patch. Also on LUCENE-5189, Rob raised a concern about SegmentInfo.attributes that have same limitation -- if a Codec modifies them, they are silently being ignored, since we don't gen the .si files. I think we can easily solve that by recording SI.attributes in SegmentInfos, so they are recorded per-commit. But I think it should be handled in a separate issue. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome back, Wolfgang Hoschek!
Welcome back Wolfgang! On Fri, Sep 27, 2013 at 2:19 PM, Robert Muir rcm...@gmail.com wrote: Welcome back! On Thu, Sep 26, 2013 at 6:21 AM, Uwe Schindler uschind...@apache.org wrote: Hi, I'm pleased to announce that after a long abstinence, Wolfgang Hoschek rejoined the Lucene/Solr committer team. He is working now at Cloudera and plans to help with the integration of Solr and Hadoop. Wolfgang originally wrote the MemoryIndex, which is used by the classical Lucene highlighter and ElasticSearch's percolator module. Looking forward to new contributions. Welcome back heavy committing! :-) Uwe P.S.: Wolfgang, as soon as you have setup your subversion access, you should add yourself back to the committers list on the website as well. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Commented] (LUCENE-5123) invert the codec postings API
[ https://issues.apache.org/jira/browse/LUCENE-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772862#comment-13772862 ] Han Jiang commented on LUCENE-5123: --- Nice change! Although PushFieldsConsumer is still using the old API, I like the migrating of flush() logic from FreqProxTermsWriterPerField to PushFieldsConsumer, the calling chain is more clear in codec level now. :) Also, I'm quite curious whether StoredFields and TermVectors will get rid of 'merge()' later. invert the codec postings API - Key: LUCENE-5123 URL: https://issues.apache.org/jira/browse/LUCENE-5123 Project: Lucene - Core Issue Type: Wish Reporter: Robert Muir Assignee: Michael McCandless Fix For: 5.0 Attachments: LUCENE-5123.patch, LUCENE-5123.patch, LUCENE-5123.patch, LUCENE-5123.patch, LUCENE-5123.patch Currently FieldsConsumer/PostingsConsumer/etc is a push oriented api, e.g. FreqProxTermsWriter streams the postings at flush, and the default merge() takes the incoming codec api and filters out deleted docs and pushes via same api (but that can be overridden). It could be cleaner if we allowed for a pull model instead (like DocValues). For example, maybe FreqProxTermsWriter could expose a Terms of itself and just passed this to the codec consumer. This would give the codec more flexibility to e.g. do multiple passes if it wanted to do things like encode high-frequency terms more efficiently with a bitset-like encoding or other things... A codec can try to do things like this to some extent today, but its very difficult (look at buffering in Pulsing). We made this change with DV and it made a lot of interesting optimizations easy to implement... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Can we use TREC data set in open source?
I read here http://lemurproject.org/clueweb09/ that there is a hosted version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a hosted version), and to get access to it, someone from the ASF will need to sign an Organizational Agreement with them as well as each individual in the project will need to sign an Individual Agreement (retained by the ASF). Perhaps this can be available only to committers. This is nice! I'll try to ask ASF about this. To this day, I think the only way it will happen is for the community to build a completely open system, perhaps based off of Common Crawl or our own crawl and host it ourselves and develop judgments, etc. Yeah, this is what we need in ORP. Most people like the idea, but are not sure how to distribute it in an open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure how they would handle any copyright/redaction claims against it. There is, of course, little incentive for those involved to solve these, either, as most people who are interested sign the form and pay the $600 for the disks. Sigh, yes, it is hard to make a data set totally public. Actually, one of my purpose in this question is to see whether it is acceptable in our community (i.e. lucene/solr only) to obtain a data set not open to all people. When expand to a larger scope, the license issue is somewhat hairy... And since Shai has found a possible 'free' data set, I think it is possible for ASF to obtain an Organizational Agreement for this. I'll try to contact ASF CMU about how they define person with the authority in OSS. On Tue, Sep 17, 2013 at 6:11 AM, Grant Ingersoll gsing...@apache.orgwrote: Inline below On Sep 9, 2013, at 10:53 PM, Han Jiang jiangha...@gmail.com wrote: Back in 2007 Grant contacted with NIST about making TREC collection available to our community: http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser I think a try for this is really important to our project and people who use Lucene. All these years the speed performance is mainly tuned on Wikipedia, however it's not very 'standard': * it doesn't represent how real-world search works; * it cannot be used to evaluate the relevance of our scoring models; * researchers tend to do experiments on other data sets, and usually it is hard to know whether Lucene performs its best performance; And personally I agree with this line: I think it would encourage Lucene users/developers to think about relevance as much as we think about speed. There's been much work to make Lucene's scoring models pluggable in 4.0, and it'll be great if we can explore more about it. It is very appealing to see a high-performance library work along with state-of-the-art ranking methods. And about TREC data set, the problems we met are: 1. NIST/TREC does not own the original collections, therefore it might be necessary to have direct contact with those organizations who really did, such as: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html http://lemurproject.org/clueweb12/ 2. Currently, there is no open-source license for any of the data sets, so it won't be as 'open' as Wikipedia is. As is proposed by Grant, a possibility is to make the data set accessible only to committers instead of all users. It is not very open-source then, but TREC data sets is public and usually available to researchers, so people can still reproduce performance test. I'm quite curious, has anyone explored getting an open-source license for one of those data sets? And is our community still interested about this issue after all these years? It continues to be of interest to me. I've had various conversations throughout the years on it. Most people like the idea, but are not sure how to distribute it in an open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure how they would handle any copyright/redaction claims against it. There is, of course, little incentive for those involved to solve these, either, as most people who are interested sign the form and pay the $600 for the disks. I've had a number of conversations about how I view this to be a significant barrier to open research, esp. in under-served countries and to open source. People sympathize with me, but then move on. To this day, I think the only way it will happen is for the community to build a completely open system, perhaps based off of Common Crawl or our own crawl and host it ourselves and develop judgments, etc. We tried to get this off the ground w/ the Open Relevance Project, but there was never a sustainable effort, and thus I have little hope at this point for it (but I would love to be proven wrong) For it to succeed, I think we would need the backing of a University with students interested in curating such a collection, the judgments, etc. I think we could figure out how to distribute the data either
Can we use TREC data set in open source?
Back in 2007 Grant contacted with NIST about making TREC collection available to our community: http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser I think a try for this is really important to our project and people who use Lucene. All these years the speed performance is mainly tuned on Wikipedia, however it's not very 'standard': * it doesn't represent how real-world search works; * it cannot be used to evaluate the relevance of our scoring models; * researchers tend to do experiments on other data sets, and usually it is hard to know whether Lucene performs its best performance; And personally I agree with this line: I think it would encourage Lucene users/developers to think about relevance as much as we think about speed. There's been much work to make Lucene's scoring models pluggable in 4.0, and it'll be great if we can explore more about it. It is very appealing to see a high-performance library work along with state-of-the-art ranking methods. And about TREC data set, the problems we met are: 1. NIST/TREC does not own the original collections, therefore it might be necessary to have direct contact with those organizations who really did, such as: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html http://lemurproject.org/clueweb12/ 2. Currently, there is no open-source license for any of the data sets, so it won't be as 'open' as Wikipedia is. As is proposed by Grant, a possibility is to make the data set accessible only to committers instead of all users. It is not very open-source then, but TREC data sets is public and usually available to researchers, so people can still reproduce performance test. I'm quite curious, has anyone explored getting an open-source license for one of those data sets? And is our community still interested about this issue after all these years? -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760160#comment-13760160 ] Han Jiang commented on LUCENE-3069: --- Mike, thanks for the review! bq. In general, couldn't the writer re-use the reader's TermState? I'm afraid this somewhat makes codes longer? I'll make a patch to see this. {quote} Have you run first do no harm perf tests? Ie, compare current trunk w/ default Codec to branch w/ default Codec? Just to make sure there are no surprises... {quote} Yes, no surprise yet. bq. Why does Lucene41PostingsWriter have impersonation code? Yeah, these should be removed. {quote} I forget: why does the postings reader/writer need to handle delta coding again (take an absolute boolean argument)? Was it because of pulsing or sep? It's fine for now (progress not perfection) ... but not clean, since delta coding is really an encoding detail so in theory the terms dict should own that ... {quote} Ah, yes, because of pulsing. This is because.. PulsingPostingsBase is more than a PostingsBaseFormat. It somewhat acts like a term dict, e.g. it needs to understand how terms are structured in one block (term No.1 uses absolute value, term No.x use delta value) then judge how to restruct the inlined and wrapped block (No.1 still uses absolute value, but the first-non-pulsed term will need absolute encoding as well). Without the argument 'absolute', the real term dictionary will do the delta encoding itself, then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to encode metadata values without delta-format. {quote} The new .smy file for Pulsing is sort of strange ... but necessary since it always uses 0 longs, so we have to store this somewhere ... you could put it into FieldInfo attributes instead? {quote} Yeah, it is another hairy thing... the reason is, we don't have a 'PostingsTrailer' for PostingsBaseFormat. Pulsing will not know the longs size for each field, until all the fields are consumed... and it should not write those longsSize to termsOut in close() since the term dictionary will use the DirTrailer hack here. (maybe every term dictionary should close postingsWriter first, then write field summary and close itself? I'm not sure though). bq. Should we backport this to 4.x? Yeah, OK! Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760325#comment-13760325 ] Han Jiang commented on LUCENE-3069: --- I think this is ready to commit to trunk now, and I'll wait for a day or two before committing it. :) Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757676#comment-13757676 ] Han Jiang commented on LUCENE-3069: --- OK! These two term dicts are both FST-based: * FST term dict directly uses FST to map term to its metadata stats (FSTTermData) * FSTOrd term dict uses FST to map term to its ordinal number (FSTLong), and the ordinal is then used to seek metadata from another big chunk. I prefer the second impl since it puts much less stress on FST. I have updated the detailed format explaination in last commit. Hmm, I'll create another patch for this... Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757771#comment-13757771 ] Han Jiang commented on LUCENE-3069: --- Yes, with slight changes, it can support seek by ord. (With FST.getByOutput). Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch from last commit, and summary: Previously our term dictionary were both block-based: * BlockTerms dict breaks terms list into several blocks, as a linear structure with skip points. * BlockTreeTerms dict uses a trie-like structure to decide how terms are assigned to different blocks, and uses an FST index to optimize seeking performance. However, those two kinds of term dictionary don't hold all the term data in memory. For the worst case there would be at least two seeks: one from index in memory, another from file on disk. And we already have many complicated optimizations for this... If by design a term dictionary can be memory resident, the data structure will be simpler (after all we don't need maintain extra file pointers for a second-time seek, and we don't have to decide heuristic for how terms are clustered). And this is why those two FST-based implementation are introduced. Another big change in the code is: since our term dictionaries were both block-based, previous API was also limited. It was the postings writer who collected term metadata, and the term dictionary who told postings writer the range of terms it should flush to block. However, encoding of terms data should be decided by term dictionary part, since postings writer doesn't always know how terms are structured in term dictionary... Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had to use terms' ordinal in block to decide how to write metadata, which is unnecessary. To make the API between term dict and postings list more 'pluggable' and 'general', I refactored the PostingsReader/WriterBase. For example, the postings writer should provide some information to term dictionary, like how many metadata values are strictly monotonic, so that term dictionary can optimize delta-encoding itself. And since the term dictionary now fully decides how metadata are written, it gets the ability to utilize intblock-based metadata encoding. Now the two implementations of term dictionary can easily be plugged with current postings formats, like: * FST41 = FSTTermdict + Lucene41PostingsBaseFormat, * FSTOrd41 = FSTOrdTermdict + Lucene41PostingsBaseFormat. * FSTOrdPulsing41 = FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat About performance, as shown before, those two term dict improve on primary key lookup, but still have overhead on wildcard query (both two term dict have only prefix information, and term dictionary cannot work well with this...). I'll try to hack this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch The uploaded patch should show all the changes against trunk: I added two different implementations of term dict, and refactored the PostingsBaseFormat to plug in non-block based term dicts. I'm still working on the javadocs, and maybe we should rename that 'temp' package, like 'fstterms'? Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5199) Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field
[ https://issues.apache.org/jira/browse/LUCENE-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756670#comment-13756670 ] Han Jiang commented on LUCENE-5199: --- Thanks Shai! Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field --- Key: LUCENE-5199 URL: https://issues.apache.org/jira/browse/LUCENE-5199 Project: Lucene - Core Issue Type: Improvement Components: general/test Reporter: Shai Erera Assignee: Shai Erera Fix For: 5.0, 4.5 Attachments: LUENE-5199.patch On LUCENE-5178 Han reported the following test failure: {noformat} [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues [junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0) [junit4] less than 10 ([8) [junit4] less than or equal to 10 (]8) [junit4] over 90 (8) [junit4] 9... but was:...(0) [junit4] less than 10 ([28) [junit4] less than or equal to 10 (2]8) [junit4] over 90 (8) [junit4] 9... [junit4] at __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0) [junit4] at org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670) [junit4] at java.lang.Thread.run(Thread.java:722) {noformat} which can be reproduced with {noformat} tcase=TestRangeAccumulator -Dtests.method=testMissingValues -Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 -Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8 {noformat} It seems that the Codec that is picked is a Lucene45Codec with Lucene42DVFormat, which does not support docsWithFields for numericDV. We should improve LTC.defaultCodecSupportsDocsWithField to take a list of fields and check that the actual DVF used for each field supports it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5199) Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field
[ https://issues.apache.org/jira/browse/LUCENE-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756766#comment-13756766 ] Han Jiang commented on LUCENE-5199: --- Thanks Rob! Yeah, I just hit another failure around TestSortDocValues. :) Improve LuceneTestCase.defaultCodecSupportsDocsWithField to check the actual DocValuesFormat used per-field --- Key: LUCENE-5199 URL: https://issues.apache.org/jira/browse/LUCENE-5199 Project: Lucene - Core Issue Type: Improvement Components: general/test Reporter: Shai Erera Assignee: Shai Erera Fix For: 5.0, 4.5 Attachments: LUCENE-5199.patch, LUENE-5199.patch On LUCENE-5178 Han reported the following test failure: {noformat} [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues [junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0) [junit4] less than 10 ([8) [junit4] less than or equal to 10 (]8) [junit4] over 90 (8) [junit4] 9... but was:...(0) [junit4] less than 10 ([28) [junit4] less than or equal to 10 (2]8) [junit4] over 90 (8) [junit4] 9... [junit4] at __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0) [junit4] at org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670) [junit4] at java.lang.Thread.run(Thread.java:722) {noformat} which can be reproduced with {noformat} tcase=TestRangeAccumulator -Dtests.method=testMissingValues -Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 -Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8 {noformat} It seems that the Codec that is picked is a Lucene45Codec with Lucene42DVFormat, which does not support docsWithFields for numericDV. We should improve LTC.defaultCodecSupportsDocsWithField to take a list of fields and check that the actual DVF used for each field supports it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5178) doc values should expose missing values (or allow configurable defaults)
[ https://issues.apache.org/jira/browse/LUCENE-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756313#comment-13756313 ] Han Jiang commented on LUCENE-5178: --- During test I somehow hit a failure: {noformat} [junit4] FAILURE 0.27s | TestRangeAccumulator.testMissingValues [junit4] Throwable #1: org.junit.ComparisonFailure: expected:...(0) [junit4] less than 10 ([8) [junit4] less than or equal to 10 (]8) [junit4] over 90 (8) [junit4] 9... but was:...(0) [junit4] less than 10 ([28) [junit4] less than or equal to 10 (2]8) [junit4] over 90 (8) [junit4] 9... [junit4]at __randomizedtesting.SeedInfo.seed([815B6AA86D05329C:EBC638EE498F066D]:0) [junit4]at org.apache.lucene.facet.range.TestRangeAccumulator.testMissingValues(TestRangeAccumulator.java:670) [junit4]at java.lang.Thread.run(Thread.java:722) {noformat} Seed: {noformat} ant test -Dtestcase=TestRangeAccumulator -Dtests.method=testMissingValues -Dtests.seed=815B6AA86D05329C -Dtests.slow=true -Dtests.postingsformat=Lucene41 -Dtests.locale=ca -Dtests.timezone=Australia/Currie -Dtests.file.encoding=UTF-8 {noformat} doc values should expose missing values (or allow configurable defaults) Key: LUCENE-5178 URL: https://issues.apache.org/jira/browse/LUCENE-5178 Project: Lucene - Core Issue Type: Improvement Reporter: Yonik Seeley Fix For: 5.0, 4.5 Attachments: LUCENE-5178.patch, LUCENE-5178_reintegrate.patch DocValues should somehow allow a configurable default per-field. Possible implementations include setting it on the field in the document or registration of an IndexWriter callback. If we don't make the default configurable, then another option is to have DocValues fields keep track of whether a value was indexed for that document or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5194) TestBackwardsCompatibility should not test Pulsing41
[ https://issues.apache.org/jira/browse/LUCENE-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754307#comment-13754307 ] Han Jiang commented on LUCENE-5194: --- Thanks Mike! TestBackwardsCompatibility should not test Pulsing41 Key: LUCENE-5194 URL: https://issues.apache.org/jira/browse/LUCENE-5194 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Fix For: 5.0, 4.5 Spinoff from LUCENE-3069, where Billy discovered this ... For some reason it's currently testing a Pulsing41 index (at least index.41.cfs.zip), but we do not guarantee back compat for PulsingPF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, to show the impersonation hack for Pulsing format. We cannot perfectly impersonate old pulsing format yet: the old format divided metadata block as inlined bytes and wrapped bytes, so when the term dict reader reads the length of metadata block, it is actually the length of 'inlined block'... And the 'wrapped block' won't be loaded for wrapped PF. However, to introduce a new method in PostingsReaderBase doesn't seem to be a good way... Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, it will show how current codecs (Block/BlockTree + Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. TestBackwardsCompatibility still fails, and I'll work on the impersonation later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748582#comment-13748582 ] Han Jiang commented on LUCENE-3069: --- bq. Patch looks great on quick look! I'll look more when I'm back bq. online... OK! I commit it so that we can see later changes. bq. One thing: I think e.g. BlockTreeTermsReader needs some back-compat bq. code, so it won't try to read longsSize on old indices? Yes, both two Block* term dict will have a new VERSION variable to mark the change, and if codec header shows a previous version, they will not read that longSize VInt. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742902#comment-13742902 ] Han Jiang commented on LUCENE-5179: --- Thanks! I'll commit. Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5179: -- Issue Type: Sub-task (was: Improvement) Parent: LUCENE-3069 Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Sub-task Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang closed LUCENE-5179. - Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Sub-task Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang resolved LUCENE-5179. --- Resolution: Fixed Lucene Fields: New,Patch Available (was: New) Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Sub-task Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
Han Jiang created LUCENE-5179: - Summary: Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5179: -- Attachment: LUCENE-5179.patch Patch for branch3069, tests pass for all 'temp' postings format. Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742787#comment-13742787 ] Han Jiang commented on LUCENE-5179: --- bq. Is it for real back compat or for impersonation ? bq. Real back-compat (reader can read the old index format using the new APIs) should work fine, I think? Yes, this should be 'impersonation', but actually the back-compat I mentioned is a weak requirement. I'm not happy with this revert as well, so let's see if we can do something to hack it! :) The strong requirement is, if we need pulsing work with the new API, there should be something to tell pulsing how to encode each term. Ideally pulsing should tell term dict longsSize=0, while maintaining wrapped PF's longsSize. The calling chain is: {noformat} termdict ~~finishTermA(long[0], byte[]...)~ pulsing ~~finishTermB(long[3], byte[]...)~ wrappedPF {noformat} Take the terms=[ a a1 ... ] example, when term b is finished: the wrappedPF will fill long[] and byte[] with its metatdata, and pulsing will instead fills byte[] as its 'fake' metadata. When term is not inlined, pulsing will have to encode wrapped PF's long[] into byte[], but its too early! Since term b should be delta-encoded with term a, and pulsing will never know this... If we only need pulsing to work, there is a trade off: the pulsing returns wrapped PF's longsSize, and term dict can do the buffering. For Lucene41Pulsing with position+payloads, we'll have to write 3 zero VLong, along with the pulsing byte[] for an inlined term... and it's not actually 'pulsing' then. Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742792#comment-13742792 ] Han Jiang commented on LUCENE-5179: --- By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding you mentioned. You can have a look at the changed TempFSTTermsWriter, here we always pass 'true' to encodeTerm, so PBF will not do any delta encoding. Instead FST takes the responsibility. When we need to block encode the long[] for a whole term block, term dict can simply buffer all the long[] returned by encodeTerm(...,true), then use the compressin algorithm. Whether to do VLong encode is decided by term dict, not PBF. 'encodeTerm' only operates 'delta-encode', and provde PBF the chance to know how current term is flushed along with other terms. Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding
[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742792#comment-13742792 ] Han Jiang edited comment on LUCENE-5179 at 8/17/13 2:05 AM: By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding you mentioned. You can have a look at the changed TempFSTTermsWriter, here we always pass 'true' to encodeTerm, so PBF will not do any delta encoding. Instead FST takes the responsibility. When we need to block encode the long[] for a whole term block, term dict can simply buffer all the long[] returned by encodeTerm(...,true), then use the compressin algorithm. Whether to do VLong/delta encode is still decided by term dict, not PBF.'encodeTerm' only operates 'delta-encode', and provde PBF the chance to know how current term is flushed along with other terms. was (Author: billy): By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding you mentioned. You can have a look at the changed TempFSTTermsWriter, here we always pass 'true' to encodeTerm, so PBF will not do any delta encoding. Instead FST takes the responsibility. When we need to block encode the long[] for a whole term block, term dict can simply buffer all the long[] returned by encodeTerm(...,true), then use the compressin algorithm. Whether to do VLong encode is decided by term dict, not PBF. 'encodeTerm' only operates 'delta-encode', and provde PBF the chance to know how current term is flushed along with other terms. Refactoring on PostingsWriterBase for delta-encoding Key: LUCENE-5179 URL: https://issues.apache.org/jira/browse/LUCENE-5179 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Assignee: Han Jiang Fix For: 5.0, 4.5 Attachments: LUCENE-5179.patch A further step from LUCENE-5029. The short story is, previous API change brings two problems: * it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it; * pulsing codec have problem with it. And long story... With the change, current PostingsBase API will be like this: * term dict tells PBF we start a new term (via startTerm()); * PBF adds docs, positions and other postings data; * term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]); * term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk. So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the term, metadata list. However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA. Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF. For example, we have terms=[a, a1, a2, b, b1 b2] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: [a b] [a1 a2] [b1 b2], with this approach, the metadata of term b is delta encoded base on metadata of a. but when term dict tells PBF to finishTerm(b), it might silly do the delta encode base on term a2. So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side. For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, update BlockTerms dict so that it follows refactored API. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738105#comment-13738105 ] Han Jiang commented on LUCENE-3069: --- Hi, currently, we have problem when migrating the codes to trunk: The API refactoring on PostingsReader/WriterBase now splits term metadata into two parts: monotonic long[] and generical byte[], the former is known by term dictionary for better d-gap encoding. So we need a 'longsSize' in field summary, to tell reader the fixed length of this monotonic long[]. However, this API change actually breaks backward compability: the old 4.x indices didn't support this, and for some codec like Lucene40, since their writer part are already deprecated, their tests won't pass. It seems like we can put all the metadata in generic byte[] and let PBF do its own buffering (like we do in old API: nextTerm() ), however we'll have to add logics for this, in every PBF then. So... can we solve this problem more elegantly? Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch with backward compability fix on Lucene41PBF (TempPostingsReader is actually a fork of Lucene41PostingsReader). Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Uploaded patch. It is optimized for wildcardquery, and I did a quick test on 1M wiki data: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 314.63 (1.5%) 314.64 (1.2%) 0.0% ( -2% -2%) Fuzzy1 91.32 (3.7%) 92.50 (1.6%) 1.3% ( -3% -6%) Respell 104.54 (3.9%) 106.97 (1.6%) 2.3% ( -2% -8%) Fuzzy2 38.22 (4.1%) 39.16 (1.2%) 2.5% ( -2% -8%) Wildcard 109.56 (3.1%) 273.42 (5.0%) 149.6% ( 137% - 162%) {noformat} and TempFSTOrd vs. Lucene41, on 1M data: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff Respell 134.85 (3.7%) 106.30 (0.6%) -21.2% ( -24% - -17%) Fuzzy2 47.78 (4.1%) 39.03 (0.9%) -18.3% ( -22% - -13%) Fuzzy1 112.02 (3.0%) 91.95 (0.6%) -17.9% ( -20% - -14%) Wildcard 326.68 (3.5%) 273.41 (1.9%) -16.3% ( -20% - -11%) PKLookup 194.61 (1.8%) 314.24 (0.7%) 61.5% ( 57% - 65%) {noformat} But I'm not happy with it :(, the hack I did here is to consume another big block to store the last byte of each term. So for wildcard query ab*c, we have external information to tell the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput to jump to the next target term. Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c', so this optimization comes to be a big chunk. However I don't really like this patch :(, we have to increase index size (521M = 530M), and the code comes to be mess up, since we always have to foresee the next arc on current stack. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: VInt block lenght in Lucene 4.1 postings format
Hi Aleksandra, The PostingsReader uses a skip list to determine the start file pointer of each block (both FOR packed and vInt encoded). The information is currently maintained by Lucene41SkipReader. The tricky part is, for each term, the skip data is exactly at the end of TermFreqs blocks, so, if you fetch the startFP for vInt block, and knows the docTermStartOffset skipOffset for current term, you can calculate out what you need. http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Frequencies On Thu, Aug 1, 2013 at 4:20 PM, Aleksandra Woźniak aleksandra.k.wozn...@gmail.com wrote: Hi all, recently I wanted to try out some modifications of Lucene's postings format (namely, copying blocks that have no deletions without int-decoding/encoding -- this is similar to what was described here: https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing Lucene 4.1 postings format to check what can be done there. I came across the following problem: in Lucene41PostingsReader the length (number of bytes) of the last, vInt-encoded, block of posting in not known before all individual postings are read and decoded. When reading this block we only know the number of postings that should be read and decoded -- since vInts have different sizes by definition. If I wanted to copy the whole block without vInt decoding/encoding, I need to know how many bytes I have to read from postings index input. So, my question is: is there a clean way to determine the length of this block (ie. the number of bytes that this block has)? Is the number of bytes in a posting list tracked somewhere in Lucene 4.1 postings format? Thanks, Aleksandra -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724955#comment-13724955 ] Han Jiang commented on LUCENE-3069: --- Performance result after last patch(intersect) is applied. On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with intersect): {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 232.47 (1.0%) 205.28 (2.0%) -11.7% ( -14% - -8%) Prefix3 26.93 (1.2%) 28.40 (1.4%) 5.5% ( 2% -8%) Wildcard6.75 (2.1%)7.37 (1.5%) 9.2% ( 5% - 13%) Fuzzy1 29.86 (1.8%) 51.87 (3.7%) 73.7% ( 67% - 80%) Fuzzy2 30.82 (1.6%) 53.82 (2.7%) 74.7% ( 69% - 80%) Respell 27.30 (1.2%) 49.55 (2.6%) 81.5% ( 76% - 86%) {noformat} So the decoding of outputs is really the main hurt. And now we should start to compare it with trunk (base=Lucene41, comp=TempFSTOrd): Hmm, I must have done something wrong on wildcard query here. {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff Wildcard 19.21 (2.1%)7.30 (0.3%) -62.0% ( -63% - -60%) Prefix3 33.69 (1.2%) 28.18 (0.9%) -16.4% ( -18% - -14%) Fuzzy1 61.59 (2.1%) 52.36 (0.8%) -15.0% ( -17% - -12%) Fuzzy2 60.94 (1.0%) 54.15 (1.3%) -11.1% ( -13% - -8%) Respell 54.21 (2.8%) 49.54 (1.2%) -8.6% ( -12% - -4%) PKLookup 148.40 (1.0%) 208.07 (3.6%) 40.2% ( 35% - 45%) {noformat} I'll commit current version so we can iterate on it. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13725288#comment-13725288 ] Han Jiang commented on LUCENE-3069: --- bq. Maybe try testing on a different wildcard query, e.g. something like a*b* (that does not have a commonSuffix)? I replace all the ab*c in tasks file with ab*c*, but the performance hit is still heavy: 33M wikidata, Lucene41 vs. TempFSTOrd {noformat} Wildcard7.40 (1.9%)4.63 (1.2%) -37.5% ( -39% - -34%) {noformat} Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Cassandra Targett as Lucene/Solr committer
Welcome Cassandra! On Thu, Aug 1, 2013 at 6:47 AM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that Cassandra Targett has accepted to join our ranks as a committer. Cassandra worked on the donation of the new Solr Reference Guide [1] and getting things in order for its first official release [2]. Cassandra, it is tradition that you introduce yourself with a brief bio. Welcome! P.S. As soon as your SVN access is setup, you should then be able to add yourself to the committers list on the website as well. [1] https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide [2] https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5152) Lucene FST is not immutale
[ https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723922#comment-13723922 ] Han Jiang commented on LUCENE-5152: --- bq. So its really just a BytesRef bug right? +1, so tricky Lucene FST is not immutale -- Key: LUCENE-5152 URL: https://issues.apache.org/jira/browse/LUCENE-5152 Project: Lucene - Core Issue Type: Bug Components: core/FSTs Affects Versions: 4.4 Reporter: Simon Willnauer Priority: Blocker Fix For: 5.0, 4.5 Attachments: LUCENE-5152.patch a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned output from and FST (BytesRef) which caused sideffects in later execution. I added an assertion into the FST that checks if a cached root arc is modified and in-fact this happens for instance in our MemoryPostingsFormat and I bet we find more places. We need to think about how to make this less trappy since it can cause bugs that are super hard to find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-5152.patch Previous design put much stress on decoding of Outputs. This becomes disaster for wildcard queries: like for f*nd, we usually have to walk to the last character of FST, then find that it is not 'd' and automaton doesn't accept this. In this case, TempFST is actually iterating all the result of f*, which decodes all the metadata for them... So I'm trying another approach, the main idea is to load metadata stats as lazily as possible. Here I use FSTLong as term index, and leave all other stuff in a single term block. The term index FST holds the relationship between Term, Ord, and in the term block we can maintain a skip list for find related metadata stats. It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). Tests pass, and intersect is not implemented yet. perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 373.80 (0.0%) 320.30 (0.0%) -14.3% ( -14% - -14%) Fuzzy1 43.82 (0.0%) 47.10 (0.0%) 7.5% ( 7% -7%) Prefix3 399.62 (0.0%) 433.95 (0.0%) 8.6% ( 8% -8%) Fuzzy2 14.26 (0.0%) 15.95 (0.0%) 11.9% ( 11% - 11%) Respell 40.69 (0.0%) 46.29 (0.0%) 13.8% ( 13% - 13%) Wildcard 83.44 (0.0%) 96.54 (0.0%) 15.7% ( 15% - 15%) {noformat} perf hit on pklookup should be sane, since I haven't optimize the skip list. I'll update intersect() later, and later we'll cutover to PagedBytes PackedLongBuffer. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-5152.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: (was: LUCENE-5152.patch) Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch, revive IntersectTermsEnum in TempFSTOrd. Mike, since we already have an intersect() impl, maybe we can still keep this? By the way, it is easy to migrate from TempFST to TempFSTOrd. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5138) Update source file attributes
Han Jiang created LUCENE-5138: - Summary: Update source file attributes Key: LUCENE-5138 URL: https://issues.apache.org/jira/browse/LUCENE-5138 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Priority: Minor Fix For: 5.0, 4.5 Currently we have many java files with executable attribute, while some scripts that generate source files are missing this. Maybe we should clean this up? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5138) Update source file attributes
[ https://issues.apache.org/jira/browse/LUCENE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5138: -- Attachment: LUCENE-5138.patch Patch, created by: {noformat} find -executable -type f -name *.java -exec svn propdel svn:executable {} \; {noformat} Since our builder is going to regenerate source files soon, maybe it is ok to keep the executable bit missing for those scripts? Update source file attributes - Key: LUCENE-5138 URL: https://issues.apache.org/jira/browse/LUCENE-5138 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Priority: Minor Fix For: 5.0, 4.5 Attachments: LUCENE-5138.patch Currently we have many java files with executable attribute, while some scripts that generate source files are missing this. Maybe we should clean this up? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-5138) Update source file attributes
[ https://issues.apache.org/jira/browse/LUCENE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang closed LUCENE-5138. - Resolution: Fixed Update source file attributes - Key: LUCENE-5138 URL: https://issues.apache.org/jira/browse/LUCENE-5138 Project: Lucene - Core Issue Type: Improvement Reporter: Han Jiang Priority: Minor Fix For: 5.0, 4.5 Attachments: LUCENE-5138.patch Currently we have many java files with executable attribute, while some scripts that generate source files are missing this. Maybe we should clean this up? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Upload patch: implemented IntersectEnum.next() seekCeil() lots of nocommits, but passed all tests The main idea is to run a DFS on FST, and backtrack as early as possible (i.e. when we see this label is rejected by automaton) For this version, there is one explicit perf overhead: I use a real stack here, which can be replaced by a Frame[] to reuse objects. There're several aspects I didn't dig deep: * currently, CompiledAutomaton provides a commonSuffixRef, but how can we make use of it in FST? * the DFS is somewhat a 'goto' version, i.e, we can make the code cleaner with a single while-loop similar to BFS search. However, since FST doesn't always tell us how may arcs are leaving current arc, we have problem dealing with this... * when FST is large enough, the next() operation will takes much time doing the linear arc read, maybe we should make use of CompiledAutomaton.sortedTransition[] when leaving arcs are heavy. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717911#comment-13717911 ] Han Jiang commented on LUCENE-3069: --- bq. You should not need to .getPosition / .setPosition on the fstReader: Oh, yes! I'll fix. bq. I think we can't really make use of it, which is fine (it's an optional optimization). OK, actually I was quite curious why we don't make use of commonPrefixRef in CompiledAutomaton. Maybe we can determinize the input Automaton first, then get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't always long enough to take into consideration? bq. But this can only be done if that FST node's arcs are array'd right? Yes, array arcs only, and we might need methods like advance(label) to do the search, and here gossip search might work better than traditional binary search. {quote} Separately, supporting ord w/ FST terms dict should in theory be not so hard; you'd need to use getByOutput to seek by ord. Maybe (later, eventually) we can make this a write-time option. We should open a separate issue ... {quote} Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state? We always have to start from first arc during every seek. However, I'm not sure in what kinds of usecase we need the ord information. I'll commit current version first, so we can iterate. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 5.0, 4.5 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: for those of you using gmail...
On Wed, Jul 17, 2013 at 10:26 PM, Michael McCandless luc...@mikemccandless.com wrote: Can you try this search in your gmail: from:jenk...@thetaphi.de regression build 6605 And let me know if you get 1 or 0 results back? Yes, 0 results here. 1 result when remove 'regression'. And seems that it returns no result for query: from:jenk...@thetaphi.de subject:build 6605 ANY_WORD_NOT_IN_TITLE Maybe for some mails, only title field are taken into consideration? I get 0 results back but I should get 1, I think. Furthermore, if I search for: from:jenk...@thetaphi.de regression I only get results up to Jul 2, even though there are many build failures after that. A recent search got results up to #6530. Still no 6605. -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch: revert hashCode() Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang commented on LUCENE-3069: --- bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to nodes can be 'merged'. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:20 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether to nodes can be 'merged'. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 2:35 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708486#comment-13708486 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 4:09 PM: bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. By the way, for real data, when two outputs are not 'NO_OUTPUT', even they contains the same metadata + stats, it seems to be very seldom that their arcs can be identical on FST (increases less than 1MB for wikimedium1m if equals always return false for non-singleton argument). Therefore... yes, hashCode() isn't necessary here. was (Author: billy): bq. I think we should assert that the seekCeil returned SeekStatus.FOUND? Ok! I'll commit that. bq. useCache is an ancient option from back when we had a terms dict cache Yes, I suppose is is not 'clear' to have this parameter. bq. seekExact is working as it should I think. Currently, I think those 'seek' methods are supposed to change the enum pointer based on input term string, and fetch related metadata from term dict. However, seekExact(BytesRef, TermsState) simply 'copy' the value of termState to enum, which doesn't actually operate 'seek' on dictionary. bq. Maybe instead of term and meta members, we could just hold the current pair? Oh, yes, I once thought about this, but not sure: like, can the callee always makes sure that, when 'term()' is called, it will always return a valid term? The codes in MemoryPF just return 'pair.output' regardless whether pair==null, is it safe? bq. TempMetaData.hashCode() doesn't mix in docFreq/tTF? Oops! thanks, nice catch! bq. It doesn't impl equals (must it really impl hashCode?) -Hmm, do we need equals? Also, NodeHash relys on hashCode to judge whether two fst nodes can be 'merged'.- Oops, I forgot it still relys on equals to make sure two instance really matches, ok, I'll add that. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: LUCENE-3069.patch Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash:128). Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708638#comment-13708638 ] Han Jiang edited comment on LUCENE-3069 at 7/15/13 5:08 PM: Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash.java:128). Although later, for nodes with outputs, it'll hardly find a same node from hashtable. was (Author: billy): Patch according to previous comments. We still somewhat need the existance of hashCode(), because in NodeHash, it will check whether the frozen node have the same hashcode with uncompiled node (NodeHash:128). Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang commented on LUCENE-3069: --- I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:04 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:00 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, --I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small)-- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707709#comment-13707709 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM: - I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, -I think the index size we can reduce is about 67.5MB- -(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for body field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Using following estimation, the old size for (df+ttf) here is 148.7MB. When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB. When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks Robert! {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, --I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small)-- (ah I forgot we already steals bit for this case in Lucene41PBF. I'll test this later. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: df-ttf-estimate.txt Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below: {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707780#comment-13707780 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 4:48 PM: Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below(KB): {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? was (Author: billy): Uploaded detail data for wikimediumall. Oh, sorry, there is an error when I caculated index size for df==0 trick, it should be 105MB instead of 70MB. But the real test is still beyond estimation (weird...). df==0 tricks gains similar compression. Index size are below: {noformat} v0: 13195304 v1 = v0 + flag byte: 12847172 v2 = v1 + steal bit: 12770700 v3 = v1 + zero df:12780884 {noformat} Another thing that surprised me is, with the same code/conf, luceneutil creates different sizes of index? I tested that df==0 trick several times on wikimedium1m, the index size varies from 514M~522M... Will multi-threading affects much here? Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-3069: -- Attachment: example.png LUCENE-3069.patch Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). * long[] (sortable metadata) * byte[] (unsortable, generic metadata) * df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: {noformat} pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard9.31 (1.7%)6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% -9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase9.02 (4.4%)9.03 (4.5%) 0.2% ( -8% -9%) LowPhrase6.26 (3.4%)6.27 (4.1%) 0.2% ( -7% -8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase6.63 (2.7%)6.66 (2.7%) 0.5% ( -4% -6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% -4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% -6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% -3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% -6%) MedSpanNear8.39 (2.7%)8.56 (2.9%) 2.0% ( -3% -7%) IntNRQ3.12 (1.9%)3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% -8%) HighSpanNear5.43 (1.8%)5.60 (2.6%) 3.1% ( -1% -7%) {noformat} {noformat} pure TempFST vs. pure Lucene41, on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard9.37 (1.3%)6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase6.33 (3.7%)6.23 (4.0%) -1.6% ( -8% -6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase9.14 (5.2%)9.07 (3.7%) -0.7% ( -9% -8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1%) 25.58 (12.8%) -0.5% ( -23% - 29
[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706703#comment-13706703 ] Han Jiang edited comment on LUCENE-3069 at 7/13/13 1:42 AM: Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). * long[] (sortable metadata) * byte[] (unsortable, generic metadata) * df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: {noformat} pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard9.31 (1.7%)6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% -9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase9.02 (4.4%)9.03 (4.5%) 0.2% ( -8% -9%) LowPhrase6.26 (3.4%)6.27 (4.1%) 0.2% ( -7% -8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase6.63 (2.7%)6.66 (2.7%) 0.5% ( -4% -6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% -4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% -6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% -3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% -6%) MedSpanNear8.39 (2.7%)8.56 (2.9%) 2.0% ( -3% -7%) IntNRQ3.12 (1.9%)3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% -8%) HighSpanNear5.43 (1.8%)5.60 (2.6%) 3.1% ( -1% -7%) {noformat} {noformat} pure TempFST vs. pure Lucene41, on wikimediumall TaskQPS base StdDevQPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard9.37 (1.3%)6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase6.33 (3.7%)6.23 (4.0%) -1.6% ( -8% -6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase9.14 (5.2%)9.07 (3.7%) -0.7% ( -9% -8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1
[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707649#comment-13707649 ] Han Jiang commented on LUCENE-3069: --- bq. Cool idea! I wonder how many of those are df == ttf == 1? I didn't try a very precise estimation, but the percentage will be large: For the index of wikimedium1m, the larget segment has a 'body' field with: {noformat} bitwidth/7 df==ttf df 1 1324400 / 1542987 2 110 / 18951 3 0 / 175 4 0 / 0 5 0 / 0 {noformat} That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' since we're using VInt encoding. So, for this field, we can save (1324400+110*2) bytes by stealing one bit. bq. Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1? Yes, that may helps! I'll try to test the percentage. But still we should note that, df is a small part in term dict data. Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Han Jiang Labels: gsoc2013 Fix For: 4.4 Attachments: example.png, LUCENE-3069.patch FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict
[ https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang updated LUCENE-5029: -- Attachment: LUCENE-5029.patch This patch keeps the original 'customize termstate in PBF' design. It also pushes flushTermsBlock readTermsBlock to term dict side. Now the rule is: if you PBF have some monotonical but 'don't care' values, always fill -1 on them, so that term dict will reuse previous values to 'pad' that -1s. Yes Mike, the algebra is really simple :) But I still have a problem removing that termBlockOrd from BlockTermState: every time a caller uses seekExact(), it is expected to get a new term state in which 'termBlockOrd' is involved. However I cannot fully understand how this variable works, and maybe we can use metadataUpto to replace this? I'll try this later. Can you put the TestDrillSideway fix in lucene3069 branch as well? Thanks :) factor out a generic 'TermState' for better sharing in FST-based term dict -- Key: LUCENE-5029 URL: https://issues.apache.org/jira/browse/LUCENE-5029 Project: Lucene - Core Issue Type: Sub-task Reporter: Han Jiang Assignee: Han Jiang Priority: Minor Fix For: 4.4 Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch Currently, those two FST-based term dict (memory codec blocktree) all use FSTBytesRef as a base data structure, this might not share much data in parent arcs, since the encoded BytesRef doesn't guarantee that 'Outputs.common()' always creates a long prefix. While for current postings format, it is guaranteed that each FP (pointing to .doc, .pos, etc.) will increase monotonically with 'larger' terms. That means, between two Outputs, the Outputs from smaller term can be safely pushed towards root. However we always have some tricky TermState to deal with (like the singletonDocID for pulsing trick), so as Mike suggested, we can simply cut the whole TermState into two parts: one part for comparation and intersection, another for restoring generic data. Then the data structure will be clear: this generic 'TermState' will consist of a fixed-length LongsRef and variable-length BytesRef. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict
[ https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Han Jiang resolved LUCENE-5029. --- Resolution: Fixed PostingsBase is now pluggable for non-based term dict, and the introduction of long[] and byte[] naturally helps the delta-encoding in both block-based term dict, and FST-based term dict. factor out a generic 'TermState' for better sharing in FST-based term dict -- Key: LUCENE-5029 URL: https://issues.apache.org/jira/browse/LUCENE-5029 Project: Lucene - Core Issue Type: Sub-task Reporter: Han Jiang Assignee: Han Jiang Priority: Minor Fix For: 4.4 Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch Currently, those two FST-based term dict (memory codec blocktree) all use FSTBytesRef as a base data structure, this might not share much data in parent arcs, since the encoded BytesRef doesn't guarantee that 'Outputs.common()' always creates a long prefix. While for current postings format, it is guaranteed that each FP (pointing to .doc, .pos, etc.) will increase monotonically with 'larger' terms. That means, between two Outputs, the Outputs from smaller term can be safely pushed towards root. However we always have some tricky TermState to deal with (like the singletonDocID for pulsing trick), so as Mike suggested, we can simply cut the whole TermState into two parts: one part for comparation and intersection, another for restoring generic data. Then the data structure will be clear: this generic 'TermState' will consist of a fixed-length LongsRef and variable-length BytesRef. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org