Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Mon, 15 Mar 2010 17:49:48 -0700

On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
> I mean specifically one should not have to commit to the precise
> scoring model they will use for a given field, when they index that
> field.


Yeah, I've never seen committing to a precise scoring model at index-time via
Sim choice as a big deal.  In Lucy, per-field Similarity assignments are part
of the the Schema, which has to be set at index-time.  And index-time Sim
choice is the way things have always been done in Lucene.

In any case, the proposal to start delaying Sim choice to search-time -- while
a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
because it would kill the cheap-Searcher model to generate boost bytes at
Searcher construction time and cache them within the object.  We need those
boost bytes written to disk so we can mmap them and share them amongst many
cheap Searchers.

So... you're proposing shrinking Similarity's public API by removing
functionality that Lucy can't live without.  If indeed that works out for
Lucene, the role of Similarity within the two libraries will have to diverge.
In Lucene, Similarity will get smaller; in Lucy it will expand a bit.

To my mind, these are all related data reduction tasks:

  * Omit doc-boost and field-boost, replacing them with a single float
    docXfield multiplier -- because you never need doc-boost on its own.
  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
    replacing them all with a single boost byte -- because for the kind of
    scoring you want to do, you don't need all those raw stats.
  * Omit the boost byte, because you don't need to do scoring at all.
  * Omit positions because you don't need PhraseQueries, etc. to match.
  * Omit everything except doc-id, because you only need binary matching.
    
What al those tasks all have in common is that we can determine what stats are
disposable based on how the user describes how they are going to use the
field.

For Lucy, the user is going to have to commit to a "precise scoring model" at
index-time by specifying a Sim choice anyway.  If that Sim turns out to be a
MatchSimilarity, why on earth should we keep around the boost bytes?

> > And what class other than Similarity knows enough about the scoring 
> > algorithm
> > to perform these data reduction tasks?  If it's not goint to be Similarity
> > itself, it has to be something that know absolutely everything about the
> > Similarity implementation's scoring model.
> 
> I don't follow this...
> 
> It will be Sim that does computes norm bytes.

I meant that if you're writing out boost bytes, there's no sensible way to
execute the lossy data reduction and reduce the index size other than having
Sim do it.  

> >  class MySim extends Similarity {
> >    public PostingCodec makePostingCodec() {
> >      StandardPostingCodec codec = new StandardPostingCodec();
> >      codec.setOmitBoostBytes(true);
> >      codec.setOmitPositions(true);
> >      return (PostingCodec)codec;
> >    }
> >  }
> 
> This still feels like you are mixing two very different concepts --
> what's being written (boost bytes, positions, docTermFreqs) vs how it's
> encoded (codec).  

So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()?
Maybe that's right.  Guess I'll watch to see how flex pans out and what
methods you put on those PostingCodec classes.

For now, I just want to make the no-boost-bytes and doc-id-only index
optimizations available, and to achieve that, it's sufficient to implement
format-follows-sim and publish MatchSimilarity and MinimalSimilarity.  The
PostingCodec API can remain a private implementation detail until a later
date.

> Shouldn't Lucy's schema record what stats should be indexed for the field?  

No, it shouldn't -- not directly.  

You tell the Schema how you want the field to be used.  That information is
used to derive what stats are needed, and whether the ones that are needed can
be combined, compressed, etc.

> Then, any codec you swap in should respect that?  EG maybe I use PForCodec
> instead, or a PulsingCode(PForCodec)?

I guess.  I don't see publishing a PForCodec with an elaborate API as being
very important, though.  It's more important to just use PFOR internally when
it's the best choice.

> I'm thinking the various Sim classes, which you'd select during
> searching, will note in jdocs what attrs must be indexed.  It's your
> job to read that and set your field (schema) up accordingly, ie,
> enable those required attrs. 

Yeah, that'll at least get the job done for Lucene.  

I don't think it's ideal to force people to understand that stuff, but hey,
the more people are confused, the more important it is for them to buy
optimization seminars where Lucene gurus explain all the obscure incantations
to them.  :)

> > You seem to be fixated on the notion of swapping in a MatchOnlySim object at
> > search time.  You can't do that in KS/Lucy, because you can't modify a 
> > Schema
> > at search-time, and the per-field Similarity assignments are part of the
> > Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
> > do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
> > spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.
> 
> I am fixated because it's a glaring example (to me) of what's wrong
> with forcing user to commit to how scoring is going to happen, at
> index time, for that field.

Haha, well that would sure suck if it didn't work!  

But I'm telling you it's no problem.

> And I'm still confused on how this'll work in Lucey -- if in my global
> write-once Lucy scheme I bind a field during indexing to
> AllBellsAndWhistlesScoringSim... then at search time, sure, it can
> spawn a doc-id-only PostingDecoder... so that does mean I can do
> match-only searching using that, somehow?  

Of course.

Lucene can't do that?  No way, that can't be right!  I've gotta be missing
something.  (Though I guess that would explain the fixation on needing a
different Sim.)  

Needing a special Sim for match-only seems like an absurd limitation -- I mean
the doc id data is there, and you don't need scores.  You've gotta be able to
fake it at least.

> (Ie I can't change the field to MatchOnlySim, but, I have a some workaround
> that lets me achieve the same functionality...?).

It's not a workaround.  Things just work that way.  

Without getting into the gory details... if you're not calculating a score,
you don't need Similarity's functionality.  If Lucene still needs a Sim object
despite not needing its functionality, that's just an accident of the OO
design, and it so happens that our "loose C" port doesn't have the same quirk.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to