Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Thu, 18 Mar 2010 03:16:55 -0700

On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey <[email protected]> wrote:
> On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
>> I mean specifically one should not have to commit to the precise
>> scoring model they will use for a given field, when they index that
>> field.
>
> Yeah, I've never seen committing to a precise scoring model at index-time via
> Sim choice as a big deal.  In Lucy, per-field Similarity assignments are part
> of the the Schema, which has to be set at index-time.  And index-time Sim
> choice is the way things have always been done in Lucene.


OK.  It's new territory -- I haven't heard of users doing lots of
scoring experimentation with Lucene.  But, then, it's not easy to do
now, so... chicken & egg.

Also, will Lucy store the original stats?  Ie so the chosen Sim
can properly recompute all boost bytes (if it uses those), for scoring
models that "pivot" based on avg's of these stats?

> In any case, the proposal to start delaying Sim choice to search-time -- while
> a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
> because it would kill the cheap-Searcher model to generate boost bytes at
> Searcher construction time and cache them within the object.  We need those
> boost bytes written to disk so we can mmap them and share them amongst many
> cheap Searchers.

It'd seem like Lucy could re-gen the boost bytes if a different Sim
were selected, or, the current Sim hadn't yet computed & cached its
bytes?  But then logically this means a "reader" needs write
permission to the index dir, which is not good...

> So... you're proposing shrinking Similarity's public API by removing
> functionality that Lucy can't live without.  If indeed that works out for
> Lucene, the role of Similarity within the two libraries will have to diverge.
> In Lucene, Similarity will get smaller; in Lucy it will expand a bit.

Yes.

> To my mind, these are all related data reduction tasks:
>
>  * Omit doc-boost and field-boost, replacing them with a single float
>    docXfield multiplier -- because you never need doc-boost on its own.
>  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
>    replacing them all with a single boost byte -- because for the kind of
>    scoring you want to do, you don't need all those raw stats.
>  * Omit the boost byte, because you don't need to do scoring at all.
>  * Omit positions because you don't need PhraseQueries, etc. to match.

I wouldn't group this one with the others -- I mean technically it is
"data reduction" -- but omitting positions means certain queries
(PhraseQuery) won't work even in "match only" searching.  Whereas the
rest of these examples affect how scoring is done (or whether it's
done).

>  * Omit everything except doc-id, because you only need binary matching.
>
> What al those tasks all have in common is that we can determine what stats are
> disposable based on how the user describes how they are going to use the
> field.
>
> For Lucy, the user is going to have to commit to a "precise scoring model" at
> index-time by specifying a Sim choice anyway.

Right.

> If that Sim turns out to be a MatchSimilarity, why on earth should
> we keep around the boost bytes?

Well maybe some queries do scoring on the field and some don't...

>> > And what class other than Similarity knows enough about the scoring 
>> > algorithm
>> > to perform these data reduction tasks?  If it's not goint to be Similarity
>> > itself, it has to be something that know absolutely everything about the
>> > Similarity implementation's scoring model.
>>
>> I don't follow this...
>>
>> It will be Sim that does computes norm bytes.
>
> I meant that if you're writing out boost bytes, there's no sensible way to
> execute the lossy data reduction and reduce the index size other than having
> Sim do it.

Right Sim is the right class to do this.  Heck one could even use
boost nibbles... or, use float.  This is an impl detail of the Sim
class.

>> >  class MySim extends Similarity {
>> >    public PostingCodec makePostingCodec() {
>> >      StandardPostingCodec codec = new StandardPostingCodec();
>> >      codec.setOmitBoostBytes(true);
>> >      codec.setOmitPositions(true);
>> >      return (PostingCodec)codec;
>> >    }
>> >  }
>>
>> This still feels like you are mixing two very different concepts --
>> what's being written (boost bytes, positions, docTermFreqs) vs how it's
>> encoded (codec).
>
> So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()?
> Maybe that's right.  Guess I'll watch to see how flex pans out and what
> methods you put on those PostingCodec classes.

Yeah I see that (setOmitBoostBytes) part of the field's type.  It's
like precisionStep for a numeric field, or omitTF/P.  Any codec should
respect these.

> For now, I just want to make the no-boost-bytes and doc-id-only index
> optimizations available, and to achieve that, it's sufficient to implement
> format-follows-sim and publish MatchSimilarity and MinimalSimilarity.  The
> PostingCodec API can remain a private implementation detail until a later
> date.

OK.

>> Shouldn't Lucy's schema record what stats should be indexed for the field?
>
> No, it shouldn't -- not directly.
>
> You tell the Schema how you want the field to be used.  That information is
> used to derive what stats are needed, and whether the ones that are needed can
> be combined, compressed, etc.

OK, we just disagree here.

>> Then, any codec you swap in should respect that?  EG maybe I use PForCodec
>> instead, or a PulsingCode(PForCodec)?
>
> I guess.  I don't see publishing a PForCodec with an elaborate API as being
> very important, though.  It's more important to just use PFOR internally when
> it's the best choice.

But there are tradeoffs of each that we can't just "pick" ourselves.

PFor is slower indexing but faster searching, especially match-only
searching, I think.

Pulsing is perhaps only helpful when you indent to lookup many
terms on the field at once (eg a big "in list" on a primary key
field).  Or if you expect to do MTQs spanning many terms...eg
FuzzyQuery.

>> I'm thinking the various Sim classes, which you'd select during
>> searching, will note in jdocs what attrs must be indexed.  It's your
>> job to read that and set your field (schema) up accordingly, ie,
>> enable those required attrs.
>
> Yeah, that'll at least get the job done for Lucene.
>
> I don't think it's ideal to force people to understand that stuff, but hey,
> the more people are confused, the more important it is for them to buy
> optimization seminars where Lucene gurus explain all the obscure incantations
> to them.  :)

Zing!

I think this all boils down to how important flexible scoring is --
I'd like users to be able to try out different scoring at search
time, even if it means "having to understand low level stuff" when
setting their field types during indexing.

You don't think flexible scoring is that important ("just reindex")
and that's it's not great to have users understand low level stats for
indexing.

I can see both sides.  I'm just on the other side of the see-saw ;)
I'm picking a different lesser evil...

>> > You seem to be fixated on the notion of swapping in a MatchOnlySim object 
>> > at
>> > search time.  You can't do that in KS/Lucy, because you can't modify a 
>> > Schema
>> > at search-time, and the per-field Similarity assignments are part of the
>> > Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
>> > do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
>> > spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.
>>
>> I am fixated because it's a glaring example (to me) of what's wrong
>> with forcing user to commit to how scoring is going to happen, at
>> index time, for that field.
>
> Haha, well that would sure suck if it didn't work!
>
> But I'm telling you it's no problem.

OK at this point I'll just take your word for it :)  I don't fully
understand how it'll work but I don't really need to.

>> And I'm still confused on how this'll work in Lucey -- if in my global
>> write-once Lucy scheme I bind a field during indexing to
>> AllBellsAndWhistlesScoringSim... then at search time, sure, it can
>> spawn a doc-id-only PostingDecoder... so that does mean I can do
>> match-only searching using that, somehow?
>
> Of course.
>
> Lucene can't do that?  No way, that can't be right!  I've gotta be missing
> something.  (Though I guess that would explain the fixation on needing a
> different Sim.)
>
> Needing a special Sim for match-only seems like an absurd limitation -- I mean
> the doc id data is there, and you don't need scores.  You've gotta be able to
> fake it at least.
>
>> (Ie I can't change the field to MatchOnlySim, but, I have a some workaround
>> that lets me achieve the same functionality...?).
>
> It's not a workaround.  Things just work that way.
>
> Without getting into the gory details... if you're not calculating a score,
> you don't need Similarity's functionality.  If Lucene still needs a Sim object
> despite not needing its functionality, that's just an accident of the OO
> design, and it so happens that our "loose C" port doesn't have the same quirk.

OK maybe I do understand now... and, yes, Lucene can do this (and it's
sounding like Lucy does it the same way), either by 1) simply never
calling .score() while collecting, or 2) using a Query impl that
intentionally strips the scores so that if you do call .score() from
your collector, it subverts you and instead returns a constant.
In Lucene you do the 2nd one with
ConstantScoreQuery(QueryWrapperFilter(YourOriginalQuery)).
Kinda a mouthful though...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to