[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless (JIRA) Wed, 26 Nov 2008 03:23:39 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650959#action_12650959
 ]


Michael McCandless commented on LUCENE-1458:
--------------------------------------------

{quote}
OK, assume that you slurp all three files. Here's the code from above, ported
from C to Java.
{quote}
Looks good!

{quote}
You can also use FileChannels to memory map this stuff, right? (Have to be
careful on 32-bit systems, though.)
{quote}
Yes.

{quote}
First, a lot of tree algorithms are optimized to a greater or lesser extent
for insertion speed, but we hardly care about that at all. We can spend all
the cycles we need at index-time balancing nodes within a segment, and once
the tree is written out, it will never be updated.
{quote}

Right, neither inserts nor deletes matter to us.

{quote}
Second, when we are writing out the term dictionary at index-time, the raw
data will be fed into the writer in sorted order as iterated values, one
term/term-info pair at a time. Ideally, the writer would be able to serialize
the tree structure during this single pass, but it could also write a
temporary file during the terms iteration then write a final file afterwards.
The main limitation is that the writer will never be able to "see" all
terms at once as an array.
{quote}

Lucene differs from Lucy/KS in this.  For Lucene, when flushing a
new segment, we can assume you can see all Terms in RAM at once.  We
don't make use of this today (it's a simple iteration that's given to
the consumer), but we could.  In Lucene, when RAM is full, we flush a
real segment (but KS flushes a "run" which I think is more of a raw
dump, ie, you don't build lexicon trees during that?).

However, for both Lucene and Lucy/KS, during merging one cannot assume
the entire lexicon can be in RAM at once.  But then, during merging
you could in theory merge trees not expanded terms.

I think for starters at least we should stick with the simple
shared-prefix-compression we have today.

{quote}
Third, at read-time we're going to have one of these trees per segment. We'd
really like to be able to conflate them somehow. KinoSearch actually
implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue;
MultiLexicon_Next() advances the queue to the next unique term. However,
that's slow, unwieldy, and inflexible. Can we do better?
{quote}

Continuing the move towards pushing searching closer to the segments
(ie, using MultiSearcher instead of MultiReader), I think we should
not try to conflate the terms dict?

{quote}
It would be ideal if we could separate the keys from the values and put all
the keys in a single file.
{quote}

Why not inline the value with the key?  The pointer to the value just
consumes extra space.  I think "value" in this context is the long
offset into the main terms dict file, which then stores the "real
[opaque] value" for each term.

{quote}
> Though, if we want to do neat things like respelling, wildcard/prefix
> searching, etc., which reduce to graph-intersection problems, we would
> need the suffix and we would need the entire lexicon (not just every
> 128th index term) compiled into the FST.

The main purpose of breaking out a separate index structure is to avoid binary
searching over the large primary file. There's nothing special about the
extra file - in fact, it's a drawback that it doesn't include all terms. If
we can jam all the data we need to binary search against into the front of the
file, but include the data for all terms in an infrequently-accessed tail, we
win.
{quote}
And... if your terms index is in RAM, to minimize its net size and
decode cost on loading.


> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to