Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Sat, 13 Mar 2010 03:41:57 -0800

On Fri, Mar 12, 2010 at 8:31 PM, Marvin Humphrey <[email protected]> wrote:
> On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
>> > So there would be polymorphism in the decoding phase while we're supplying
>> > information the Similarity object needs to make its similarity judgments.
>> > However, that polymorphism would be handled internally -- it wouldn't be 
>> > the
>> > responsibility of the user to determine whether a codec supported a 
>> > particular
>> > scoring model.
>>
>> Is that "yes" (a user can do MatchOnlySim at search time" if the field
>> were indexed with B25Sim)?
>
> In essence, yes.  Technically, no.
>
> Under the covers, doc-id-only postings iteration probably wouldn't be
> implemented by spawning a doc-id-only Similarity object.  It would probably be
> something more like, ask the Similarity for a PostingDecoder with no extra
> attributes.  And then docID-freq-boost postings iteration might be achieved by
> asking the Similarity for a PostingDecoder with TermFreq and DocBoost
> attributes.


Hmm ok so the Sim impls will expose postings with and w/o these attrs.
So then if the postings can't support TermFreq/Boost attrs, it'll
return some sort of error indicating this field can't support scoring?

>> How will Lucy "know" which switchups (Sim at indexing vs Sim at
>> searching) are "OK"...
>
> I think the theme is that each Similarity class will have a whitelist of
> supported posting iteration configurations.  So long as the requested config
> is in the whitelist, you get an iterator back -- otherwise, you get NULL.
>
> Exactly what form the request specification would take, that's up in the air.
> But it would be an implementation detail for now.  So long as the file format
> supports the data, we can build an iterator that reads it, regardless of
> encoding.

OK.

I think that white list is a postings thing, not a sim thing :)  The
index is or isn't able to provide a postings iterating the requested
attrs, and that means you can or cannot use the Sims requiring those
attrs.  Forcing the indirection through Sim (where Sim tells you you
cannot pull this particular postings) doesn't seem right...

It seems like we can actually do this quite cleanly if everything were
an attr (or at least referenced by an attr at read time).  Ie I make
an array of attrs and ask the index if it can give me those attrs.

[DocIdAttr] would be requested for match only.

[DocIdAttr,PositionsAttr] would be requested for match only of a
positional query (eg phrase query).

[DocIdAttr,TermDocFreqAttr] would be requested for a scoring
non-positional query.

[DocIdAttr,TermDocFreqAttr,PositionsAttr] would be requested for a
scoring positional query.

And one could stick in their custom attrs, too.

Then, any Sim imply can be created @ search time, and it asks the
reader for whatever attrs it needs.  If it gets NULL back that means
it's a non-starter -- and you throw an exception (or, silently pretend
nothing matched).

>> >> Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP"
>> >> instead of saying "Field.matchOnly" (or something).  So I do agree
>> >> that it'd be better if the API made it clear what the *search* time
>> >> impact is of using this advanced Field API.
>> >
>> > In my opinion, it makes sense to communicate "match only" by way of the
>> > Similarity object as opposed to a boolean.  I think it's a good way to
>> > introduce the Similarity class and get people comfortable with it, and I 
>> > also
>> > think that it's good to keep stuff out of the FieldType API when we can.
>>
>> But say we want to also allow storing tf but not positions, because
>> really the two choices should not be coupled (as they are today with
>> Lucene's omitTFAP).
>>
>> So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
>> you omitTF).
>>
>> What Sim do you call that at indexing time?
>
> Well, those are pretty esoteric posting formats.  It's common to not need
> scores and therefore not need boost bytes (the Lucene omitNorms case).  It's
> also common to not need any matching info beyond doc id (the Lucene omitTFAP
> case).  But omitTF and omitP aren't common needs, or Lucene would have them by
> now, right?

I think it's a compelling use-case.  Ie, allow for proper scoring
of non-positional queries.

> And since they are infrequently used, Huffman-driven naming philosophy
> suggests that they should have long, low-value names: OmitPositionsSimilarity,
> OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be
> an accurate abbreviation in this scenario as opposed to the current Lucene
> omitTFAP).

Just minus the Similarity part ;) I still don't think similarity
should have any bearing during indexing.

> In other words, I don't much care what those are named because they aren't
> likely to be used except by people who A) have very, very specific use cases
> and B) really know what they're doing.
>
> In contrast, I think it's important that we come up with good names for the
> doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases.

Yes -- simple things should be simple.

I do like the name "boost bytes" more than "norms".

But I don't like baking in search concepts at index time...

>> >> We get users who are baffled that their phrase queries no longer work
>> >> after setting omitTFAP.
>> >
>> > This is still a weakness of MatchSimilarity.
>>
>> Well MatchSimilarity arguably should mean "match all queries
>> correctly, just don't score them".  Ie, positional queries should in
>> fact work... just not receive a score.
>
> Right.  However, now that I've thought about it, if a user indicates that a
> field is "match-only" by supplying a MatchSimilarity, we know that we can
> omit boost bytes.
>
> So we can re-conceive "MatchSimilarity" as being analogous to omitNorms.
> Huzzah!
>
> One down, one to go.  :)

Hmm except shouldn't you allow omitting boost bytes but keeping term
freqs?  Ie all docs are roughly the same length (say, a title field)
and I never boost them?  How will you allow this?

>> > On the other hand, typical candidates for MatchSimilarity...
>> >
>> >  * unique_id
>> >  * category
>> >  * tags
>> >
>> > ... either won't contain multiple tokens, or won't generally return 
>> > sensible
>> > results for phrase queries.
>>
>> Maybe we need to splinter MatchSim into the two cases.  Whether
>> positions are stored, and whether scoring is done, is really
>> orthogonal.
>
> Maybe "MinimalSimilarity" as the analogue for Lucene omitTFAP?  I dunno,
> that might be kind of generic, but maybe it makes sense in context.
>
> The idea is to get the user to describe how the field will be scored.  Based 
> on
> that info, we can customize the posting format, possibly making optimizations
> and omitting certain posting data.

But I don't think the user should describe how the field will be
scored, when they are indexing.  That's too early to commit.

Or.... maybe they provide all possible ways they want the field scored
(ie an array of Sims)?  And we, under the hood, map to all attrs then
required?  Hmmmmm.

> When people ask on the user list...
>
>    "How can I make my index smaller?"
>
> ... we can reply like so:
>
>    "Make some fields match-only by specifying MatchSimilarity in the
>    FieldType, or even better if you don't need phrase queries, by specifying
>    MinimalSimilarity.  You'll be throwing away data Lucy needs for
>    sophisticated queries, but your index will get smaller."
>
> I think that response is easier to understand than a response instructing them
> to "enable omitNorms", and it introduces the very important Similarity class
> rather than the confusing, overloaded, and not-very-useful terminology,
> "norms".

I agree it's not great to have to speak/think in low level indexing
attr concepts... because it forces user to translate to what that
means at search time.  But I still don't see a great alterntative.  I
don't like pushing the Sim choice all the way back into indexing.

>> >> > They could use better codecs under the format-follows-Similarity model, 
>> >> > too.
>> >> > They'd just have to subclass and override the factory methods that spawn
>> >> > posting encoders/decoders.
>> >>
>> >> Ahh, OK so that's how they'd do it.
>> >>
>> >> So... I think we're making a mountain out of a molehill.
>> >
>> > Well, I don't see it that way, because I place great value on designing
>> > good public APIs, and I think it's important that we avoid forcing users to
>> > know about codecs.
>>
>> I had thought we were bickering about whether you subclass & override
>> a method (to alter the codec) (= Lucy) vs you create your own
>> Codec/CodecProvider and pass that to your writer, which seems..... a
>> minor difference.
>>
>> If the user is not tweaking the codec, they don't have to do anything
>> with codes (the defaults work) for either Lucy or Lucene.
>>
>> So the only difference is the specifics of how the codec-tweaking-user
>> in fact alters the codec.
>
> I don't think that's the only difference.  What does the novice user know
> about "PFOR", about "pulsing", about "group varint", etc?  They aren't
> going to know jack.  So how are you expecting them to distinguish between
> various Codec subclasses named after those high-falutin' concepts?

Yeah I see we are talking about something different -- I now take back
the mole hill assertion.

But: the codec is a largely orthogonal choice that what stats (docIDs,
termDocFreq, positions) are recorded in the index.

PFOR, Standard, Pulsing(Standard,2), etc, can all encode all of these
stats...

(Though some may be better than others for certain stats, so I can
imagine picking the codec based on what stats user requested).

> The difference is that you're forcing the novice user to learn esoteric
> material just to get started, while the format-follows-sim model is trying to
> spare the novice yet enable the expert.  Users shouldn't have to distinguish
> between "codecs" until they are actually ready to write their own.

Be careful: nobody is forcing user to learn much just to get started
-- that's what defaults are for.  By default you get full scoring
(boost bytes, termDocFreq) & position, and avgTF and
fieldLengthInTokens.  There are then no restrictions on what you can
do @ search time.  This will fit 90% of the uses.

For the 10% that want to tweak, yes, they'll need to learn what they
are doing.  But I don't think they should pick Sims for indexing; they
should pick the stats they want.

Then for the 1% (probably more like 0.1%) that don't like the builtin
stat choices, will roll up their sleeves and make their own
attrs/codecs.

> As we discussed on IRC yesterday, the number of people who will be qualified
> to write posting codec code will still be very small, even after we finish
> this democratization push.  It will be a big step forward if we can just get
> more Lucene committers to grok the inner workings of posting lists.
>
> However, there are some very useful optimizations that will be underutilized
> by the user base if the public API uses jargon like "omitTFAP" and "PFORCodec"
> that shuts out everyone except elite developers.

True...

>> > Under format-follows-Sim, it would be the Similarity object that knows all
>> > supported decoding configurations for the field.
>>
>> I'm still hazy on how you'll know at search time which Sims are
>> "congruent" with what's stored in the index.... ie that downgrading to
>> MatchOnlySim is allowed, but swapping to a different scoring model is
>> not (because norms are committed at indexing time).
>
> I'm not sure that e.g. TermScorer would even know what Similarity it was
> dealing with.  It would ask for a boost-byte decoder from the sim, but it
> wouldn't have to know or care how the boost bytes got translated to float
> multipliers.

Right -- so it wouldn't know boost bytes were used to compress.  It
just calls a method to get the float boost for this doc.

> Under Lucy, you can't switch to a different weighting model at search time
> because the boost bytes are baked into the index.  But you can still do
> doc-id-only posting iteration against any posting format since doc-id-only is
> the minimum requirement for a posting list.
>
> So your question is predicated on the assumption that you need a
> doc-id-only Similarity to do doc-id-only postings iteration, but that's not
> true -- you need a doc-id-only PostingDecoder, which may be spawned by any
> Similarity.
>
> Does that make sense?

It sounds like... if the user had used AllBellsAndWhistlesScoringSim
while indexing, they will still be able to use MatchOnlySim while
searching because under-the-hood MatchOnlySim knows how to pull a
docID only postings iterator from that field.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to