[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless (JIRA) Fri, 28 Aug 2009 02:38:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748765#action_12748765
 ]


Michael McCandless commented on LUCENE-1458:
--------------------------------------------

bq. Maybe we should break this whole issue into smaller pieces? We could start 
with the dictionary. The changes you made here are really cool already.

Yeah the issue is very large now.  I'll think about how to break it
up.

I agree: the new default terms dict codec is a good step forward.
Rather than load a separate TermInfo instance for every indexed term
(costly in object overhead, and, because we store Term[] as well we
are wasting space storing many duplicate String field pointers in a
row), we only store the String and the long offset into the index file
as two arrays.  It's a sizable memory savings for indexes with many
terms.

This was a nice side-effect of genericizing things, because the
TermInfo class had to be made private to the codec since it's storing
things like proxOffset, freqOffset, etc., which is particular to how
the Lucene's default codec stores postings.

But, it's somewhat tricky to break out only this change... eg it's
also coupled with the change to strongly separate field from term
text, and, to remove TermInfo reliance.  Ie, the new terms dict has a
separate per-field class, and within that per-field class it has the
String[] termText and long[] index offsets.  I guess we could make a
drop-in class that tries to emulate TermInfosReader/SegmentTermEnum
even though it separates into per-field, internally.

bq. We could further separate the actual TermsDictReader from the terms index 
with a clean API (I think you put actually a TODO comment into your patch).

Actually the whole terms dict writing/reading is itself pluggable, so
your codec could provide its own.  Ie, Lucene "just" needs a
FieldsConsumer (for writing) and a FieldsProducer (for reading).

But it sounds like you're proposing making a strong decoupling of
terms index from terms dict?

bq. Then we can have different terms index implementations in the future, e.g. 
one that uses a tree.

+1

Or, an FST.  FST is more compelling than tree since it also compresses
suffixes.  FST is simply a tree in the front plus a tree in the back
(in reverse), where the "output" of a given term's details appears in
the middle, on an edge that is "unique" to each term, as you traverse
the graph.

bq. We could also make SegmentReader a bit cleaner: if opened just for merging 
it would not create a terms index reader at all; only if cloned for an external 
reader we would instantiate the terms index lazily. Currently this is done by 
setting the divisor to -1.

Right.  Somehow we should genericize the "I don't need the terms
index at all" when opening a SegmentReader.  Passing -1 is sort of
hackish.  Though I do prefer passing up front your intentions, rather
than loading lazily (LUCENE-1609).

We could eg pass "requirements" when asking the codec for the terms
dict reader.  EG if I don't state that RANDOM_ACCESS is required (and
only say LINEAR_SCAN) then the codec internally can make itself more
efficient based on that.

bq. In the current patch the choice of the Codec is index-wide, right? So I 
can't specify different codecs for different fields. Please correct me if I'm 
wrong.

The Codec is indeed index-wide, however, because the field vs term
text are strongly separated, it's completely within a Codec's control
to return a different reader/writer for different fields.  So it ought
to work fine... eg one in theory could make a "PerFieldCodecWrapper".
But, I haven't yet tried this with any codecs.  It would make a good
test case though... I'll write down to make a test case for this.

Also, it's fine if an index has used different codecs over time when
writing, as long as when reading you provide a PostingsCodecs
instance that's able to [correctly] retrieve those codecs to read those
segments.



> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to