Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Mon, 22 Mar 2010 09:45:40 -0700

On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
> Also, will Lucy store the original stats?


These?

   * Total number of tokens in the field.
   * Number of unique terms in the field.
   * Doc boost.
   * Field boost.

That would depend on which Similiarity the user specs for that field.  In
other words, it's just another data-reduction decision: if the Sim needs it,
keep it, and if doesn't, throw it away.

Incidentally, what are you planning to do about field boost if it's not always
1.0?  Are you going to store full 32-bit floats?

> Ie so the chosen Sim can properly recompute all boost bytes (if it uses
> those), for scoring models that "pivot" based on avg's of these stats?

Yes, we could support that.  

It's not high on my todo-list for core Lucy, though: poor payoff for all the
complexity it would introduce, particularly file format complexity with its
heavy backwards compatibility burden.  Right now, we only have the boost
bytes, and the fact that they are used for length normalization, field boost,
and doc boost is incidental.  If we add all the raw stats, that's a bunch of
stuff we have to support for a long time, yet which doesn't yield practical
advantages for us yet.

I'd be much more interested in finding a way to support such a feature as an
extension.

> > In any case, the proposal to start delaying Sim choice to search-time -- 
> > while
> > a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
> > because it would kill the cheap-Searcher model to generate boost bytes at
> > Searcher construction time and cache them within the object.  We need those
> > boost bytes written to disk so we can mmap them and share them amongst many
> > cheap Searchers.
> 
> It'd seem like Lucy could re-gen the boost bytes if a different Sim
> were selected, or, the current Sim hadn't yet computed & cached its
> bytes?  But then logically this means a "reader" needs write
> permission to the index dir, which is not good...

Whatever's reading the boost bytes can't tell the difference between process
RAM and mmap'd RAM, so write-permission on the index dir isn't required.

What's trickier is that Schemas are not normally mutable, and that they are
part of the index.  You don't have to supply an Analyzer, or a Similarity, or
anything else when opening a Searcher -- you just provide the location of the
index, and the Schema gets deserialized from the latest schema_NNN.json file.
That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much
a thing of the past for us.  But it makes your feature request of runtime
settability for Similarity awkward to implement: by the time you have a Schema
object to work with, the Searcher is already open.

  Searcher searcher = new Searcher("/path/to/index");
  Schema schema = searcher.getSchema();
  schema.setSim("content", altSim); // Too late, and not implemented anyway.

> > To my mind, these are all related data reduction tasks:
> >
> >  * Omit doc-boost and field-boost, replacing them with a single float
> >    docXfield multiplier -- because you never need doc-boost on its own.
> >  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
> >    replacing them all with a single boost byte -- because for the kind of
> >    scoring you want to do, you don't need all those raw stats.
> >  * Omit the boost byte, because you don't need to do scoring at all.
> >  * Omit positions because you don't need PhraseQueries, etc. to match.
> 
> I wouldn't group this one with the others -- I mean technically it is
> "data reduction" -- but omitting positions means certain queries
> (PhraseQuery) won't work even in "match only" searching.  Whereas the
> rest of these examples affect how scoring is done (or whether it's
> done).

Couldn't disagree more.  Omitting positions is *exactly* the kind of data
reduction task which we know is safe to perform when a user specifically tells
us they don't need PhraseQueries by specifying a MinimalSimilarity.

MinimalSimilarity will be documented as a good choice for single-token field
types like StringType, Int32Type, Float32Type, and so on -- because those
can't match multi-token PhraseQueries anyway.  Usage with FullTextType will be
discouraged.

Maybe aggressive automatic data-reduction makes more sense in the context of
"flexible matching", which is more expansive than "flexible scoring"?

> > If that Sim turns out to be a MatchSimilarity, why on earth should
> > we keep around the boost bytes?
> 
> Well maybe some queries do scoring on the field and some don't...

That would violate the contract the user made when they spec'd
MatchSimilarity.  Saying that Lucy should keep the boost bytes under those
circumstances is like saying that Lucene should outright ignore omitNorms()
and always write boost bytes because users can't be trusted.

> > I meant that if you're writing out boost bytes, there's no sensible way to
> > execute the lossy data reduction and reduce the index size other than having
> > Sim do it.
> 
> Right Sim is the right class to do this.  Heck one could even use
> boost nibbles... or, use float.  This is an impl detail of the Sim
> class.

For Lucene, I think that makes sense, because the reduced form would be
ephemeral.  

For Lucy, it's more complicated because the reduced data gets written to the
index.  Core Sim implementations should all use the same algorithm in order to
minimize the complexity of the index file spec.  However, it would be nice to
offer an extension point enabling user-defined Sims to write non-standard
formats.

> I think this all boils down to how important flexible scoring is --

Oh, please, Mike.  Search-time settability for Similarity isn't the same thing
as "flexible scoring".  :(  Everybody thinks "flexible scoring" is important.

Frankly, I think we're going to do a better job making "flexible scoring"
available to our users because we're not going to make them fight through a
thicket of jargon to get it.

> I'd like users to be able to try out different scoring at search
> time, even if it means "having to understand low level stuff" when
> setting their field types during indexing.
> 
> You don't think flexible scoring is that important ("just reindex")
> and that's it's not great to have users understand low level stats for
> indexing.

I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice feature,
but I don't think we've worked out all the problems yet.  If we can, I might
switch to +1 (FWIW).  

For Lucy, I'm -1 on search-time Sim settability, for a wide variety of
reasons.

Whether or not to perform automatic data-reduction based on Similarity choice
or force the user to specify data-reduction manually is a separate issue.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to