Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Fri, 12 Mar 2010 17:31:51 -0800

On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
> > So there would be polymorphism in the decoding phase while we're supplying
> > information the Similarity object needs to make its similarity judgments.
> > However, that polymorphism would be handled internally -- it wouldn't be the
> > responsibility of the user to determine whether a codec supported a 
> > particular
> > scoring model.
> 
> Is that "yes" (a user can do MatchOnlySim at search time" if the field
> were indexed with B25Sim)?


In essence, yes.  Technically, no.  

Under the covers, doc-id-only postings iteration probably wouldn't be
implemented by spawning a doc-id-only Similarity object.  It would probably be
something more like, ask the Similarity for a PostingDecoder with no extra
attributes.  And then docID-freq-boost postings iteration might be achieved by
asking the Similarity for a PostingDecoder with TermFreq and DocBoost
attributes. 

> How will Lucy "know" which switchups (Sim at indexing vs Sim at
> searching) are "OK"...

I think the theme is that each Similarity class will have a whitelist of
supported posting iteration configurations.  So long as the requested config
is in the whitelist, you get an iterator back -- otherwise, you get NULL.

Exactly what form the request specification would take, that's up in the air.
But it would be an implementation detail for now.  So long as the file format
supports the data, we can build an iterator that reads it, regardless of
encoding.

> >> Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP"
> >> instead of saying "Field.matchOnly" (or something).  So I do agree
> >> that it'd be better if the API made it clear what the *search* time
> >> impact is of using this advanced Field API.
> >
> > In my opinion, it makes sense to communicate "match only" by way of the
> > Similarity object as opposed to a boolean.  I think it's a good way to
> > introduce the Similarity class and get people comfortable with it, and I 
> > also
> > think that it's good to keep stuff out of the FieldType API when we can.
> 
> But say we want to also allow storing tf but not positions, because
> really the two choices should not be coupled (as they are today with
> Lucene's omitTFAP).
> 
> So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
> you omitTF).
> 
> What Sim do you call that at indexing time?

Well, those are pretty esoteric posting formats.  It's common to not need
scores and therefore not need boost bytes (the Lucene omitNorms case).  It's
also common to not need any matching info beyond doc id (the Lucene omitTFAP
case).  But omitTF and omitP aren't common needs, or Lucene would have them by
now, right?

And since they are infrequently used, Huffman-driven naming philosophy
suggests that they should have long, low-value names: OmitPositionsSimilarity,
OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be
an accurate abbreviation in this scenario as opposed to the current Lucene
omitTFAP).

In other words, I don't much care what those are named because they aren't
likely to be used except by people who A) have very, very specific use cases
and B) really know what they're doing.

In contrast, I think it's important that we come up with good names for the
doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases.

> >> We get users who are baffled that their phrase queries no longer work
> >> after setting omitTFAP.
> >
> > This is still a weakness of MatchSimilarity.
> 
> Well MatchSimilarity arguably should mean "match all queries
> correctly, just don't score them".  Ie, positional queries should in
> fact work... just not receive a score.

Right.  However, now that I've thought about it, if a user indicates that a
field is "match-only" by supplying a MatchSimilarity, we know that we can
omit boost bytes.  

So we can re-conceive "MatchSimilarity" as being analogous to omitNorms.
Huzzah!

One down, one to go.  :)

> > On the other hand, typical candidates for MatchSimilarity...
> >
> >  * unique_id
> >  * category
> >  * tags
> >
> > ... either won't contain multiple tokens, or won't generally return sensible
> > results for phrase queries.
> 
> Maybe we need to splinter MatchSim into the two cases.  Whether
> positions are stored, and whether scoring is done, is really
> orthogonal.

Maybe "MinimalSimilarity" as the analogue for Lucene omitTFAP?  I dunno,
that might be kind of generic, but maybe it makes sense in context.

The idea is to get the user to describe how the field will be scored.  Based on
that info, we can customize the posting format, possibly making optimizations
and omitting certain posting data.  

When people ask on the user list...

    "How can I make my index smaller?"
   
... we can reply like so:

    "Make some fields match-only by specifying MatchSimilarity in the
    FieldType, or even better if you don't need phrase queries, by specifying
    MinimalSimilarity.  You'll be throwing away data Lucy needs for
    sophisticated queries, but your index will get smaller."

I think that response is easier to understand than a response instructing them
to "enable omitNorms", and it introduces the very important Similarity class
rather than the confusing, overloaded, and not-very-useful terminology,
"norms".

> >> > They could use better codecs under the format-follows-Similarity model, 
> >> > too.
> >> > They'd just have to subclass and override the factory methods that spawn
> >> > posting encoders/decoders.
> >>
> >> Ahh, OK so that's how they'd do it.
> >>
> >> So... I think we're making a mountain out of a molehill.
> >
> > Well, I don't see it that way, because I place great value on designing
> > good public APIs, and I think it's important that we avoid forcing users to
> > know about codecs.
> 
> I had thought we were bickering about whether you subclass & override
> a method (to alter the codec) (= Lucy) vs you create your own
> Codec/CodecProvider and pass that to your writer, which seems..... a
> minor difference.
> 
> If the user is not tweaking the codec, they don't have to do anything
> with codes (the defaults work) for either Lucy or Lucene.
> 
> So the only difference is the specifics of how the codec-tweaking-user
> in fact alters the codec.

I don't think that's the only difference.  What does the novice user know
about "PFOR", about "pulsing", about "group varint", etc?  They aren't
going to know jack.  So how are you expecting them to distinguish between
various Codec subclasses named after those high-falutin' concepts?

The difference is that you're forcing the novice user to learn esoteric
material just to get started, while the format-follows-sim model is trying to
spare the novice yet enable the expert.  Users shouldn't have to distinguish
between "codecs" until they are actually ready to write their own.  

As we discussed on IRC yesterday, the number of people who will be qualified
to write posting codec code will still be very small, even after we finish
this democratization push.  It will be a big step forward if we can just get
more Lucene committers to grok the inner workings of posting lists.  

However, there are some very useful optimizations that will be underutilized
by the user base if the public API uses jargon like "omitTFAP" and "PFORCodec"
that shuts out everyone except elite developers.

> > Under format-follows-Sim, it would be the Similarity object that knows all
> > supported decoding configurations for the field.
> 
> I'm still hazy on how you'll know at search time which Sims are
> "congruent" with what's stored in the index.... ie that downgrading to
> MatchOnlySim is allowed, but swapping to a different scoring model is
> not (because norms are committed at indexing time).

I'm not sure that e.g. TermScorer would even know what Similarity it was
dealing with.  It would ask for a boost-byte decoder from the sim, but it
wouldn't have to know or care how the boost bytes got translated to float
multipliers.  

Under Lucy, you can't switch to a different weighting model at search time
because the boost bytes are baked into the index.  But you can still do
doc-id-only posting iteration against any posting format since doc-id-only is
the minimum requirement for a posting list.

So your question is predicated on the assumption that you need a
doc-id-only Similarity to do doc-id-only postings iteration, but that's not
true -- you need a doc-id-only PostingDecoder, which may be spawned by any
Similarity.  

Does that make sense?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to