[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641137#action_12641137 ]
Michael McCandless commented on LUCENE-1426: -------------------------------------------- bq. During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. Yes, there's this issue: https://issues.apache.org/jira/browse/LUCENE-1278 And you had found this one: http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf And then Doug referenced this: http://citeseer.ist.psu.edu/cutting90optimizations.html I think the idea makes tons of sense (saving a seek) and one of my goals in phase 2 (genericizing the reading of an index) is to make pulsing a drop-in codec as an example & litmus test. Terms iteration may suffer, though, unless we put this in a separate file. I also think, at the opposite end of the spectrum, it would make sense for very common terms to use simple n-bit packing (PFOR minus the exceptions). For massive terms we need the fastest search we can get, since that gates when you have to start sharding. bq. I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this Well the stock market seems to think the credit crunch is improving, today... of course who knows what'll happen tomorrow! Good luck :) Also, I'd like to explore improving the terms dict indexing -- I don't think we need to load a TermInfo instance for every indexed term, into RAM. I think we just need the term & seek data (into the tis file), then you seek there and skip to the TermInfo you need. This should save a good amount of RAM for large indices with odd terms, sicne each TermInfo instance requires a pointer to it (4 or 8 bytes), an object header (8 bytes at least) then 20 bytes for the members. All these explorations should become simple drop-in codecs, once I can finish phase 2. > Next steps towards flexible indexing > ------------------------------------ > > Key: LUCENE-1426 > URL: https://issues.apache.org/jira/browse/LUCENE-1426 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1426.patch > > > In working on LUCENE-1410 (PFOR compression) I tried to prototype > switching the postings files to use PFOR instead of vInts for > encoding. > But it quickly became difficult. EG we currently mux the skip data > into the .frq file, which messes up the int blocks. We inline > payloads with positions which would also mess up the int blocks. > Skipping offsets and TermInfo offsets hardwire the file pointers of > frq & prox files yet I need to change these to block + offset, etc. > Separately this thread also started up, on how to customize how Lucene > stores positional information in the index: > http://www.gossamer-threads.com/lists/lucene/java-user/66264 > So I decided to make a bit more progress towards "flexible indexing" > by first modularizing/isolating the classes that actually write the > index format. The idea is to capture the logic of each (terms, freq, > positions/payloads) into separate interfaces and switch the flushing > of a new segment as well as writing the segment during merging to use > the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]