Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Thu, 25 Mar 2010 10:21:06 -0700

On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> >> Also, will Lucy store the original stats?
> >
> > These?
> >
> >   * Total number of tokens in the field.
> >   * Number of unique terms in the field.
> >   * Doc boost.
> >   * Field boost.
> 
> Also sum(tf).  Robert can generate more :)


Hmm, aren't "Total number of tokens in the field" and sum(tf) normally
equivalent?  I guess there might be analyzers for which that isn't true, e.g.
those which perform synonym-injection?

In any case, "sum(tf)" is probably a better definition, because it makes no
ancillary claims...

> > Incidentally, what are you planning to do about field boost if it's not 
> > always
> > 1.0?  Are you going to store full 32-bit floats?
> 
> For starters, yes.  

OK, how are those going to be encoded?  IEEE 754?  Big-endian?

    http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness

> We may (later) want to make a new attr that sets
> the #bits (levels/precision) you want... then uses packed ints to
> encode.

I'm concerned that the bit-wise entropy of floats may make them a poor match
for compression via packed ints.  We'll probably get a compressed
representation which is larger than the original.

Are there any standard algorithms out there for compressing IEEE 754 floats?
RLE works, but only with certain data patterns.

... [ time passes ] ...

Hmm, maybe not:

    
http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data

> I was specifically asking if Lucy will allow the user to force true
> average to be recomputed, ie, at commit time from the writer. 

That's theoretically possible.  We'd have to implement the reader the same way
we have DeletionsReader -- the most recent segment may contain data which
applies to older segments.  

Here's the DeletionsReader code, which searches backwards through the segments
looking for a particular file:

    /* Start with deletions files in the most recently added segments and work
     * backwards.  The first one we find which addresses our segment is the
     * one we need. */
    for (i = VA_Get_Size(segments) - 1; i >= 0; i--) {
        Segment *other_seg = (Segment*)VA_Fetch(segments, i);
        Hash *metadata 
            = (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9);
        if (metadata) {
            Hash *files = (Hash*)CERTIFY(
                Hash_Fetch_Str(metadata, "files", 5), HASH);
            Hash *seg_files_data 
                = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
            if (seg_files_data) {
                Obj *count = (Obj*)CERTIFY(
                    Hash_Fetch_Str(seg_files_data, "count", 5), OBJ);
                del_count = (i32_t)Obj_To_I64(count);
                del_file  = (CharBuf*)CERTIFY(
                    Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF);
                break;
            }
        }
    }

What we'd do is write the regenerated boost bytes for *all* segments to the
most recent segment.  It would be roughly analogous to building up an NRT
reader.

> > What's trickier is that Schemas are not normally mutable, and that they are
> > part of the index.  You don't have to supply an Analyzer, or a Similarity, 
> > or
> > anything else when opening a Searcher -- you just provide the location of 
> > the
> > index, and the Schema gets deserialized from the latest schema_NNN.json 
> > file.
> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty 
> > much
> > a thing of the past for us.
> 
> That's nice... though... is it too rigid?  Do users even want to pick
> a different analyzer at search time?

It's not common.  

To my mind, the way a field is tokenized is part of its field definition, thus
the Analyzer is part of the field definition, thus the analyzer is part of the
schema and needs to be stored with the index.

Still, we support different Analyzers at search time by way of QueryParser.
QueryParser's constructor requires a Schema, but also accepts an optional
Analyzer which if supplied will be used instead of the Analyzers from the
Schema.

> > Maybe aggressive automatic data-reduction makes more sense in the context of
> > "flexible matching", which is more expansive than "flexible scoring"?
> 
> I think so.  Maybe it shouldn't be called a Similarity (which to me
> (though, carrying a heavy curse of knowledge burden...) means
> "scoring")?  Matcher?

Heh.  "Matcher" is taken.  It's a crucial class, too, roughly combining the
roles of Lucene's Scorer and DocIDSetIterator.

The first alternative that comes to mind is "Relevance", because not only can
one thing's relevance to another be continuously variable (i.e. score), it can
also be binary: relevant/not-relevant (i.e. match).

But I don't see why "Relevance", "Matcher", or anything else would be so much
better than "Similarity".  I think this is your hang up.  ;) 

> > I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice 
> > feature,
> > but I don't think we've worked out all the problems yet.  If we can, I might
> > switch to +1 (FWIW).
> 
> What problems remain, for Lucene?

Storage, formatting, and compression of boosts.

I'm also concerned about making significant changes to the file format when
you've indicated they're "for starters".  IMO, file format changes ought to
clear a higher bar than that.  But I expect to to dissent on that point.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to