[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641128#action_12641128
 ] 

Eks Dev commented on LUCENE-1426:
---------------------------------

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it 
would make sense to use VInts for very short postings and PFOR for the rest. I 
just do not remember rationale behind it.   

- During omitTf() discussion, we came up with cool idea to actually inline very 
short postings into term dict instead of storing offset. This way we spare one 
seek per term in many cases, as well as some space for storing offset. I do not 
know if this is a problem, but sounds reasonable. With standard Zipfian 
distribution, a lot of postings should get inlined. Use cases where we have 
query expansion on many terms (think spell checker, synonyms ...) should 
benefit from that heavily. These postings are small but there is a lot of them, 
so it adds up... seek is deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch 
gets over soon so I that I could dedicate some time to fun things like this :)

cheers, eks 


  

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to