[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1426:
---------------------------------------

    Attachment: LUCENE-1426.patch

Attached patch.  I think it's ready to commit... I'll wait a few days.

This factors the writing of postings into separate Format* classes.
The approach I took is similar to what I did for DocumentsWriter,
where there is a hierarchical consumer interface (abstract class) for
each of fields, terms, docs, and positions writing.  Then there's a
corresponding set of concrete classes (the "codec chain") that write
today's index format.  There is no change to the index format.

Here are the details:

  * This only applies to postings (not stored fields, term vectors,
    norms, field infos)

  * Both SegmentMerger & FreqProxTermsWriter now use the same codec
    API to write postings.  I think this is a big step forward: we now
    have a single set of classes that ever write the postings.

  * You can't yet customize this codec chain; we can add that at some
    point.  It's all package private.

  * I don't yet allow the codec to override SegmentInfo.files(); at
    some point (when I first try to make a codec that uses different
    files) I will add this.

I ran a quick performance test, indexing wikipedia, and found
negligible performance cost of this.

The next step, which is trickier, is to modularize/genericize the
classes the read from the index, and then refactor
SegmentTerm{Enum,Docs,Positions} to use that codec API.

Then, finally, I want to make a codec that uses PFOR to encode
postings.

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to