[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Michael McCandless (JIRA) Mon, 20 Oct 2008 12:58:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641137#action_12641137
 ]


Michael McCandless commented on LUCENE-1426:
--------------------------------------------

bq. During omitTf() discussion, we came up with cool idea to actually inline 
very short postings into term dict instead of storing offset.

Yes, there's this issue:

  https://issues.apache.org/jira/browse/LUCENE-1278

And you had found this one:

  http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf

And then Doug referenced this:

  http://citeseer.ist.psu.edu/cutting90optimizations.html

I think the idea makes tons of sense (saving a seek) and one of my
goals in phase 2 (genericizing the reading of an index) is to make
pulsing a drop-in codec as an example & litmus test.  Terms iteration
may suffer, though, unless we put this in a separate file.

I also think, at the opposite end of the spectrum, it would make sense
for very common terms to use simple n-bit packing (PFOR minus the
exceptions).  For massive terms we need the fastest search we can
get, since that gates when you have to start sharding.

bq. I am sorry to miss the party here with PFOR, but let us hope this credit 
crunch gets over soon so I that I could dedicate some time to fun things like 
this

Well the stock market seems to think the credit crunch is improving,
today... of course who knows what'll happen tomorrow!  Good luck :)

Also, I'd like to explore improving the terms dict indexing -- I don't
think we need to load a TermInfo instance for every indexed term, into
RAM.  I think we just need the term & seek data (into the tis file),
then you seek there and skip to the TermInfo you need.  This should
save a good amount of RAM for large indices with odd terms, sicne each
TermInfo instance requires a pointer to it (4 or 8 bytes), an object
header (8 bytes at least) then 20 bytes for the members.

All these explorations should become simple drop-in codecs, once I can
finish phase 2.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Reply via email to