Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Thu, 11 Mar 2010 02:59:32 -0800

On Tue, Mar 9, 2010 at 3:58 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
> On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote:
>>
>> >> You said "of course" before but... how in your proposal could one
>> >> store all stats for a given field during indexing, but then sometimes
>> >> use match-only and sometimes full-scoring when querying against that
>> >> field?
>> >
>> > The same way that Lucene knows that sometimes it needs a docs-only-enum and
>> > sometimes it needs a docs-and-positions enum.  Sometimes you need scores,
>> > sometimes you don't.
>>
>> But if user had specified BM25Sim when indexing... can they later just
>> change that to MatchOnlySim at search time?
>
> The user won't be able to modify the Schema by reaching into a FieldType
> object and replacing its Similarity instance.
>
> However, internally, match-only iteration of a posting list would work just
> fine.  I mean, the doc id data is there in one form or another.  Under a field
> spec'd to use MatchSimilarity, the default would be to write only one file,
> holding nothing but delta-encoded doc ids.  Under LuceneSimilarity, freq would
> probably be embedded in the doc id file, but iterating that with match-only
> just means throwing away freq.  Slightly less efficient, but still pretty
> good.


If the field is indexed with omitTFAP then it's just doc ID deltas in
the postings.

> So there would be polymorphism in the decoding phase while we're supplying
> information the Similarity object needs to make its similarity judgments.
> However, that polymorphism would be handled internally -- it wouldn't be the
> responsibility of the user to determine whether a codec supported a particular
> scoring model.

Is that "yes" (a user can do MatchOnlySim at search time" if the field
were indexed with B25Sim)?

> What Lucy users absolutely wouldn't be able to do is change up BM25 weighting
> to standard Lucene weighting at search time, because we'll be writing
> pre-calculated boost bytes at index time.  Re-indexing will be required.

OK.

> I think that's a nice feature for Lucene to provide, but Lucy will have to
> skip it because of our cheap-searcher requirement.

How will Lucy "know" which switchups (Sim at indexing vs Sim at
searching) are "OK"...

>> > What users will be able to tell us is how they want the field to be used, 
>> > and
>> > we can use that information to help us optimize.  For example, when a user
>> > declares that they want a field to be "match-only", we know we don't have 
>> > to
>> > write boost bytes, freq or positions, saving space.
>>
>> Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP"
>> instead of saying "Field.matchOnly" (or something).  So I do agree
>> that it'd be better if the API made it clear what the *search* time
>> impact is of using this advanced Field API.
>
> In my opinion, it makes sense to communicate "match only" by way of the
> Similarity object as opposed to a boolean.  I think it's a good way to
> introduce the Similarity class and get people comfortable with it, and I also
> think that it's good to keep stuff out of the FieldType API when we can.

But say we want to also allow storing tf but not positions, because
really the two choices should not be coupled (as they are today with
Lucene's omitTFAP).

So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
you omitTF).

What Sim do you call that at indexing time?

>> We get users who are baffled that their phrase queries no longer work
>> after setting omitTFAP.
>
> This is still a weakness of MatchSimilarity.

Well MatchSimilarity arguably should mean "match all queries
correctly, just don't score them".  Ie, positional queries should in
fact work... just not receive a score.

In Lucene you'd have to index normally (don't set omitTFAP) and then
wrap your query in constant score query.

> The default behavior of the KinoSearch QueryParser, which I expect Lucy to
> follow, is to expand all TermQueries and PhraseQueries out to cover all
> indexed fields.  If we include MatchSimilarity fields in that expansion, we'll
> match terms but not phrases.  Maybe that would be a little hard for users to
> understand -- shouldn't a MatchSimilarity field allow phrases to match without
> contributing to scores?

Right.

> On the other hand, typical candidates for MatchSimilarity...
>
>  * unique_id
>  * category
>  * tags
>
> ... either won't contain multiple tokens, or won't generally return sensible
> results for phrase queries.

Maybe we need to splinter MatchSim into the two cases.  Whether
positions are stored, and whether scoring is done, is really
orthogonal.

>> (Today it silently returns no results... with flex you'll get an exception).
>
> Mmm, tough call.

Yes.

>> > They could use better codecs under the format-follows-Similarity model, 
>> > too.
>> > They'd just have to subclass and override the factory methods that spawn
>> > posting encoders/decoders.
>>
>> Ahh, OK so that's how they'd do it.
>>
>> So... I think we're making a mountain out of a molehill.
>
> Well, I don't see it that way, because I place great value on designing
> good public APIs, and I think it's important that we avoid forcing users to
> know about codecs.

I had thought we were bickering about whether you subclass & override
a method (to alter the codec) (= Lucy) vs you create your own
Codec/CodecProvider and pass that to your writer, which seems..... a
minor difference.

If the user is not tweaking the codec, they don't have to do anything
with codes (the defaults work) for either Lucy or Lucene.

So the only difference is the specifics of how the codec-tweaking-user
in fact alters the codec.

>> In format-follows-Sim, it sounds like that simply means the Sim has a
>> default codec, but you can override it if you want (and it's the Sim
>> that "owns" (has the method for) handing out the Codec you'll use).
>
> Yes.
>
>> Whereas in Lucene the same defaulting will take place.  It's just that
>> Sim won't "own" picking the Codec.
>
> However, *something* down in Lucene besides the codec itself will be
> influencing decoder polymorphism.  If there was only one decoding function,
> you'd always iterate positions.  :)

Right, this is the CodecProvider.  It knows the names of all codecs
used in your index.  If you make a custom codec you'll have to give
the right CodecProvider to IndexReader so it can create the right
decoder on hitting each segment.

> Under format-follows-Sim, it would be the Similarity object that knows all
> supported decoding configurations for the field.

I'm still hazy on how you'll know at search time which Sims are
"congruent" with what's stored in the index.... ie that downgrading to
MatchOnlySim is allowed, but swapping to a different scoring model is
not (because norms are committed at indexing time).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to