On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote: > * Store additional per-doc stats in the index, eg in a custom > posting list,
Inline, as in a payload? Of course that can work, but if the data is common over multiple postings, you pay in space to gain locality. KinoSearch does this right now -- inlining the norm byte within each posting -- but I think it was a dubious tradeoff. The alternative is to build a shared data structure and point to it using one or more levels of indirection. For fixed length data with high entropy, such as norm bytes, it's hard to improve on indexing by doc id. For variable width data, you can either store the file pointer inline in the posting, or use additional indirection: doc_id -> file_pointer -> data For most cases, I think you'd want to store per-doc data by reference rather than inline it, no? But why do we have to make a choice one way or the other? So long as the decoder has the option of reading from arbitrary streams and data structures, we can leave it up to the posting format spec whether to inline or not. > including length in tokens of the field, avg tf, and > boost (boost can be efficiently stored so only if it differs from > default is it stored). Efficient compression always depends on identifying patterns within data. Classical examples: * Audio signals delta encode well because they are continuously variable and adjacent samples are usually close together in magnitude. * Line art has a great affinity for run-length-encoding because many contiguous pixels have exactly the same color value. Since boost is optional and a lot of people don't use it, it's true that if we break it out away from length normalization it can be stored efficiently using something like RLE. However, boost isn't the only thing that may be worth compressing: single-token and match-only fields have constant length-in-tokens. Surely the posting format specification must be allowed to exploit any compression technique, at own discretion and with insight into data patterns unavailable to us from the top level. > Do not compute nor store norms in the index. Heh. Well, with Lucy, naturally we'd want to do the opposite of that: write everything at index time and then mmap at search time. Even in Lucene, it seems odd to want to calculate all of those on the fly each time you open an index. It seems to me that this is a specialized need of BM25. So how do you support both? Easy: allow the postings reader an initial startup phase where it can cache arbitrary data, typically data which is shared across multiple postings. For instance: * Either slurp (Lucene) or mmap (Lucy) a pregenerated norms file. * Accumulate/derive data, e.g true normalization data for BM25. * Do nothing, because the field is match-only. You'd want to do this inside some sort of per-field class, which would then spawn individual enumerators. The shared reader holds common data; the enumerators reference the common data, but maintain state while iterating over postings. SegmentReader PostingListReader map of field-PostingsReader pairs <-- per-field shared data, persistent Requests for various enumerators (doc id only, doc id + positions, etc) would be dispatched through to the PostingsReader for the specified field. > Merging would just concatenate these values (removing deleted docs). The ability to concatenate depends on having either no externally referenced data, or having all externally referenced data be immutable. Theoretically, concatenation works fine for delta-encoded position data because positions start over at 0 with each document. They're wholly self-contained. As a counter example, consider that we can concatenate serialized documents so long as the associations between field names and numbers stay constant, but we must decode and re-encode when those associations change. The merger cannot know in advance whether concatenation is appropriate because the codec is opaque to it. I think the default should be to merge fully expanded objects, with the option to concatenate as an optimization left up to the codec. Otherwise, we end up restricting the kind of external references that can be made by a posting. > * Change IR so on open it generates norms dynamically, ie by walking > the stats, computing avgs (eg avg field length in tokens), and > computing the final per-field boost, casting to a 1-byte quantized > float. Right, I think this data should be cached by a per-field parent reader object during an initialization phase. > We may want to store aggregates in eg SegmentInfo to save > the extra pass on IR open... Hmm, that sounds familiar... https://svn.apache.org/repos/asf/lucene/lucy/trunk/core/Lucy/Index/Segment.bp Each Segment object keeps track of information about an index segment: its fields, document count, and so on. The Segment object itself writes one file, <code>segmeta.json</code>; besides storing info needed by Segment itself, the "segmeta" file serves as a central repository for metadata generated by other index components -- relieving them of the burden of storing metadata themselves. As far as aggregates go, I think you want to be careful to avoid storing any kind of data that scales with segment size within a SegmentInfo. > * Change Similarity, to allow field-specific Similarity (I think we > have issue open for this already). I think, also, lengthNorm > (which is no longer invoked during indexing) would no longer be > used. Well, as you might suspect, I consider that one a gimme. KinoSearch supports per-field Similarity now. The insane loose typing of fields in Lucene is going to make it a little tricky to implement, though. I think you just have to exclude fields assigned to specific similarity implementations from your merge-anything-to-the-lowest-common-denominator policy and throw exceptions when there are conflicts rather than attempt to resolve them. > I think we'd make the class that computes norms from these per-doc > stats on IR open pluggable. Similarity is where we decode norms right now. In my opinion, it should be the Similarity object from which we specify per-field posting formats. See my reply to Robert in the BM25 thread: http://markmail.org/message/77rmrfmpatxd3p2e That way, custom scoring implementations can guarantee that they always have the posting information they need available to make their similarity judgments. Similarity also becomes a more generalized notion, with the TF/IDF-specific functionality moving into a subclass. Similarity implementation and posting format are so closely related that in my opinion, it makes sense to tie them. > And, someday we could make what stats are gathered/stored during indexing > pluggable but for starters I think we should simply support the field length > in tokens and avg tf per field. I would argue against making this your top priority, because I think adding half-baked features that require index-format changes is bad policy. If you're looking for small steps, my suggestion would be to focus on per-field Similarity support. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org