Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Mon, 29 Mar 2010 03:47:09 -0700

On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> >> Also, will Lucy store the original stats?
>> >
>> > These?
>> >
>> >   * Total number of tokens in the field.
>> >   * Number of unique terms in the field.
>> >   * Doc boost.
>> >   * Field boost.
>>
>> Also sum(tf).  Robert can generate more :)
>
> Hmm, aren't "Total number of tokens in the field" and sum(tf) normally
> equivalent?  I guess there might be analyzers for which that isn't true, e.g.
> those which perform synonym-injection?
>
> In any case, "sum(tf)" is probably a better definition, because it makes no
> ancillary claims...


Sorry, yes they are.

>> > Incidentally, what are you planning to do about field boost if it's not 
>> > always
>> > 1.0?  Are you going to store full 32-bit floats?
>>
>> For starters, yes.
>
> OK, how are those going to be encoded?  IEEE 754?  Big-endian?
>
>    http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness

For starters, I think so.  Lucene's ints are bigendian today.

>> We may (later) want to make a new attr that sets
>> the #bits (levels/precision) you want... then uses packed ints to
>> encode.
>
> I'm concerned that the bit-wise entropy of floats may make them a poor match
> for compression via packed ints.  We'll probably get a compressed
> representation which is larger than the original.
>
> Are there any standard algorithms out there for compressing IEEE 754 floats?
> RLE works, but only with certain data patterns.
>
> ... [ time passes ] ...
>
> Hmm, maybe not:
>
>    
> http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data

Sorry, I was proposing a fixed-point boost, where you specify how many
levels (in bits, powers of 2) you want.

>> I was specifically asking if Lucy will allow the user to force true
>> average to be recomputed, ie, at commit time from the writer.
>
> That's theoretically possible.  We'd have to implement the reader the same way
> we have DeletionsReader -- the most recent segment may contain data which
> applies to older segments.

OK.

> Here's the DeletionsReader code, which searches backwards through the segments
> looking for a particular file:
>
>    /* Start with deletions files in the most recently added segments and work
>     * backwards.  The first one we find which addresses our segment is the
>     * one we need. */
>    for (i = VA_Get_Size(segments) - 1; i >= 0; i--) {
>        Segment *other_seg = (Segment*)VA_Fetch(segments, i);
>        Hash *metadata
>            = (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9);
>        if (metadata) {
>            Hash *files = (Hash*)CERTIFY(
>                Hash_Fetch_Str(metadata, "files", 5), HASH);
>            Hash *seg_files_data
>                = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
>            if (seg_files_data) {
>                Obj *count = (Obj*)CERTIFY(
>                    Hash_Fetch_Str(seg_files_data, "count", 5), OBJ);
>                del_count = (i32_t)Obj_To_I64(count);
>                del_file  = (CharBuf*)CERTIFY(
>                    Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF);
>                break;
>            }
>        }
>    }

Hmm -- simililar to tombstones?  But, different in that the most
recently written file has *all* deletions for that old segment?  Ie
you don't have to OR together N generations of written
deletions... only 1 file has all current deletions for the segment?
This is somewhat wasteful of disk space though?  Hmm unless your
deletion policy can reclaim the now-stale deletions files from past
flushed segments?

> What we'd do is write the regenerated boost bytes for *all* segments to the
> most recent segment.  It would be roughly analogous to building up an NRT
> reader.

Right, except Lucy must go through the filesystem.

>> > What's trickier is that Schemas are not normally mutable, and that they are
>> > part of the index.  You don't have to supply an Analyzer, or a Similarity, 
>> > or
>> > anything else when opening a Searcher -- you just provide the location of 
>> > the
>> > index, and the Schema gets deserialized from the latest schema_NNN.json 
>> > file.
>> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty 
>> > much
>> > a thing of the past for us.
>>
>> That's nice... though... is it too rigid?  Do users even want to pick
>> a different analyzer at search time?
>
> It's not common.
>
> To my mind, the way a field is tokenized is part of its field definition, thus
> the Analyzer is part of the field definition, thus the analyzer is part of the
> schema and needs to be stored with the index.

OK.

> Still, we support different Analyzers at search time by way of QueryParser.
> QueryParser's constructor requires a Schema, but also accepts an optional
> Analyzer which if supplied will be used instead of the Analyzers from the
> Schema.

Ahh OK there's an out.

>> > Maybe aggressive automatic data-reduction makes more sense in the context 
>> > of
>> > "flexible matching", which is more expansive than "flexible scoring"?
>>
>> I think so.  Maybe it shouldn't be called a Similarity (which to me
>> (though, carrying a heavy curse of knowledge burden...) means
>> "scoring")?  Matcher?
>
> Heh.  "Matcher" is taken.  It's a crucial class, too, roughly combining the
> roles of Lucene's Scorer and DocIDSetIterator.
>
> The first alternative that comes to mind is "Relevance", because not only can
> one thing's relevance to another be continuously variable (i.e. score), it can
> also be binary: relevant/not-relevant (i.e. match).
>
> But I don't see why "Relevance", "Matcher", or anything else would be so much
> better than "Similarity".  I think this is your hang up.  ;)

Yeah OK.

>> > I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice 
>> > feature,
>> > but I don't think we've worked out all the problems yet.  If we can, I 
>> > might
>> > switch to +1 (FWIW).
>>
>> What problems remain, for Lucene?
>
> Storage, formatting, and compression of boosts.
>
> I'm also concerned about making significant changes to the file format when
> you've indicated they're "for starters".  IMO, file format changes ought to
> clear a higher bar than that.  But I expect to to dissent on that point.

I think we do dissent on this... progress not perfection ;)

I see file format as an impl detail, not as a public API.  It's free
to change, and because it's easy to version, changing it isn't that
bad.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to