[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless (JIRA) Fri, 21 Nov 2008 03:43:39 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1458:
---------------------------------------

    Attachment: LUCENE-1458.patch


OK I created another codec, SepCodec (for lack of a better name) that
stores doc & frq & skip in 3 separate files (vs 1 for Lucene today),
as well as positions & payloads in 2 separate files (vs 1 for Lucene
today).

The code is still messy -- lots of nocommits all over the place.  I'm
still iterating.

Finally, this gets us one step closer to using PFOR!  With this codec,
the .frq, .doc and .prx are now "pure" streams of ints.

This codec was more interesting because it adds new files to the file
format, which required fixing the various interesting places where we
assume which file extensions belong to a segment.

In this patch I also created a PostingCodec class, with the 3
subclasses (so far):
 
  * DefaultCodec: new terms dict format, but same back-compatible
    prx/frq format

  * PulsingCodec: new terms dict format, but inlines rare terms into
    terms dict

  * SepCodec: new terms dict format, and splits doc/frq/skip into
    3 separate files, and prox/payload into 2 separate files

By editing the PostingCodec.getCodec method you can switch all tests
to use each codec; all tests pass using each codec.

I built the 1M Wikipedia index, using SepCodec.  Here's the ls -l:

{code}
-rw-rw-rw-  1 mike  admin    4000004 Nov 20 17:16 _0.fdt
-rw-rw-rw-  1 mike  admin    8000004 Nov 20 17:16 _0.fdx
-rw-rw-rw-  1 mike  admin  303526787 Nov 20 17:34 _n.doc
-rw-rw-rw-  1 mike  admin         33 Nov 20 17:30 _n.fnm
-rw-rw-rw-  1 mike  admin  220470670 Nov 20 17:34 _n.frq
-rw-rw-rw-  1 mike  admin    3000004 Nov 20 17:34 _n.nrm
-rw-rw-rw-  1 mike  admin  651670377 Nov 20 17:34 _n.prx
-rw-rw-rw-  1 mike  admin          0 Nov 20 17:30 _n.pyl
-rw-rw-rw-  1 mike  admin   84963104 Nov 20 17:34 _n.skp
-rw-rw-rw-  1 mike  admin     666999 Nov 20 17:34 _n.tii
-rw-rw-rw-  1 mike  admin   87551274 Nov 20 17:34 _n.tis
-rw-rw-rw-  1 mike  admin         20 Nov 20 17:34 segments.gen
-rw-rw-rw-  1 mike  admin         64 Nov 20 17:34 segments_2
{code}

Some initial observations for SepCodec:

  * Merging/optimizing was noticeably slower... I think there's some
    pending inefficiency in my changes, but it could also simply be
    that having to step through 3 (.frq, .doc, .prx) files instead of
    2 (.frq, .prx) for each segment is that much more costly.  (With
    payloads it'd be 4 files instead of 2).

  * Net index size is quite a bit larger (1300 MB vs 1139 MB), I think
    because we are not efficiently encoding the frq=1 case anymore.
    PFOR should fix that.

  * Skip data is just about as large as the terms dict, which
    surprises me (I had intuitively expected it to be smaller I
    guess).


> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to