On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote:

>   * Store additional per-doc stats in the index, eg in a custom
>     posting list, 

Inline, as in a payload?  Of course that can work, but if the data is common
over multiple postings, you pay in space to gain locality.  KinoSearch does
this right now -- inlining the norm byte within each posting -- but I think it
was a dubious tradeoff.

The alternative is to build a shared data structure and point to it using one
or more levels of indirection.  For fixed length data with high entropy, such
as norm bytes, it's hard to improve on indexing by doc id.  For variable width
data, you can either store the file pointer inline in the posting, or use
additional indirection:

    doc_id -> file_pointer -> data

For most cases, I think you'd want to store per-doc data by reference rather
than inline it, no?

But why do we have to make a choice one way or the other?  So long as the
decoder has the option of reading from arbitrary streams and data structures,
we can leave it up to the posting format spec whether to inline or not.

>     including length in tokens of the field, avg tf, and
>     boost (boost can be efficiently stored so only if it differs from
>     default is it stored).  

Efficient compression always depends on identifying patterns within data.
Classical examples:

  * Audio signals delta encode well because they are continuously variable and
    adjacent samples are usually close together in magnitude.
  * Line art has a great affinity for run-length-encoding because many
    contiguous pixels have exactly the same color value.
  
Since boost is optional and a lot of people don't use it, it's true that if we
break it out away from length normalization it can be stored efficiently using
something like RLE.  However, boost isn't the only thing that may be worth
compressing: single-token and match-only fields have constant
length-in-tokens.

Surely the posting format specification must be allowed to exploit any
compression technique, at own discretion and with insight into data patterns
unavailable to us from the top level.

>     Do not compute nor store norms in the index.  

Heh.  Well, with Lucy, naturally we'd want to do the opposite of that: write
everything at index time and then mmap at search time.  Even in Lucene, it
seems odd to want to calculate all of those on the fly each time you open an
index.  It seems to me that this is a specialized need of BM25.

So how do you support both?  Easy: allow the postings reader an initial startup
phase where it can cache arbitrary data, typically data which is shared across
multiple postings. For instance:

  * Either slurp (Lucene) or mmap (Lucy) a pregenerated norms file.
  * Accumulate/derive data, e.g true normalization data for BM25.
  * Do nothing, because the field is match-only.

You'd want to do this inside some sort of per-field class, which would then
spawn individual enumerators.  The shared reader holds common data; the
enumerators reference the common data, but maintain state while iterating over
postings.

 SegmentReader
   PostingListReader
     map of field-PostingsReader pairs <-- per-field shared data, persistent

Requests for various enumerators (doc id only, doc id + positions, etc) would
be dispatched through to the PostingsReader for the specified field.

>     Merging would just concatenate these values (removing deleted docs).

The ability to concatenate depends on having either no externally referenced
data, or having all externally referenced data be immutable.  

Theoretically, concatenation works fine for delta-encoded position data because
positions start over at 0 with each document.  They're wholly self-contained.

As a counter example, consider that we can concatenate serialized documents so
long as the associations between field names and numbers stay constant, but we
must decode and re-encode when those associations change.

The merger cannot know in advance whether concatenation is appropriate because
the codec is opaque to it.  I think the default should be to merge fully
expanded objects, with the option to concatenate as an optimization left up to
the codec.  Otherwise, we end up restricting the kind of external references
that can be made by a posting.

>   * Change IR so on open it generates norms dynamically, ie by walking
>     the stats, computing avgs (eg avg field length in tokens), and
>     computing the final per-field boost, casting to a 1-byte quantized
>     float.  

Right, I think this data should be cached by a per-field parent reader object
during an initialization phase.

>     We may want to store aggregates in eg SegmentInfo to save
>     the extra pass on IR open...

Hmm, that sounds familiar...

   https://svn.apache.org/repos/asf/lucene/lucy/trunk/core/Lucy/Index/Segment.bp

   Each Segment object keeps track of information about an index segment: 
   its fields, document count, and so on.  The Segment object itself writes 
   one file, <code>segmeta.json</code>; besides storing info needed by 
   Segment itself, the "segmeta" file serves as a central repository for 
   metadata generated by other index components -- relieving them of the 
   burden of storing metadata themselves.

As far as aggregates go, I think you want to be careful to avoid storing any
kind of data that scales with segment size within a SegmentInfo.

>   * Change Similarity, to allow field-specific Similarity (I think we
>     have issue open for this already).  I think, also, lengthNorm
>     (which is no longer invoked during indexing) would no longer be
>     used.

Well, as you might suspect, I consider that one a gimme.  KinoSearch supports
per-field Similarity now.

The insane loose typing of fields in Lucene is going to make it a little
tricky to implement, though.  I think you just have to exclude fields assigned
to specific similarity implementations from your
merge-anything-to-the-lowest-common-denominator policy and throw exceptions
when there are conflicts rather than attempt to resolve them.

> I think we'd make the class that computes norms from these per-doc
> stats on IR open pluggable.  

Similarity is where we decode norms right now.  In my opinion, it should be
the Similarity object from which we specify per-field posting formats.

See my reply to Robert in the BM25 thread:

    http://markmail.org/message/77rmrfmpatxd3p2e

    That way, custom scoring implementations can guarantee that they always
    have the posting information they need available to make their similarity
    judgments. Similarity also becomes a more generalized notion, with the
    TF/IDF-specific functionality moving into a subclass. 

Similarity implementation and posting format are so closely related that in my
opinion, it makes sense to tie them. 

> And, someday we could make what stats are gathered/stored during indexing
> pluggable but for starters I think we should simply support the field length
> in tokens and avg tf per field.

I would argue against making this your top priority, because I think adding
half-baked features that require index-format changes is bad policy.

If you're looking for small steps, my suggestion would be to focus on
per-field Similarity support.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to