Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Mon, 15 Mar 2010 03:29:04 -0700

On Mon, Mar 15, 2010 at 12:03 AM, Marvin Humphrey
<[email protected]> wrote:
> On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote:
>
>> I still don't think similarity should have any bearing during indexing.
>
> Similarity has always, from day one, affected the contents of the index.  This
> idea that it should be totally divorced from indexing is, in fact, a very
> significant change that you are proposing for Lucene, and it will require
> non-trivial changes to the file format.


I agree.  Instead of storing byte per doc I'm proposing storing the
raw stats and letting Sim compute that byte at search time.  We can
also allow that Sim to cache stuff (boost bytes, if it uses them) to
make startup faster, eventually.

> For starters, you're going to at least double the footprint of the norms.  For
> fields with more than 127 tokens or 127 unique terms, the increase will be
> greater... and if the user sets doc-boost and field-boost in a pattern that
> defies RLE compression, the footprint will be greater still.

On disk, yes.  In memory, no (assuming your Sim impl encodes boost as byte).

> I happen to think that limited search-time settability of Similarity offers a
> nice feature -- the ability to futz with different weighting models and length
> normalization settings without reindexing -- and that it's worth exploring in
> pursuit of this feature.
>
> But by opting to forego the lossy compression now performed by encodeNorm() at
> index-time and store precursor statistics instead, we are going to take a hit
> on index size even with lossless compression.

I think it's worth letting the custom Sim cache stuff [privately] on
disk, ie the byte norms, eventually.

> Furthermore, delaying Similarity choice means that it becomes the user's
> responsibility to ensure that index-time Codec choice is compatible with
> search-time Similarity choice.  In contrast, setting Similarity at index-time
> means that the core gets to pick the Codec and can ensure that all the
> necessary data gets encoded, sparing the user from having to understand the
> gory details of posting formats.

Yeah this is the part I struggle with -- how to make index-time field
options "intelligible".  But I think good defaulting does 90% of the
work.  The remaining 10% can work backwards from their search needs to
what must be done at indexing.

> In summary, I think search-time setting of Similarity is a nice feature but a
> poor requirement.  I'm not persuaded that this proposal to banish Similarity
> from index-time is wise.

OK I think we just differ...

>> But I don't like baking in search concepts at index time...
>
> Then you ought to use a traditional RDBMS rather than an indexing engine, and
> make sure you don't put indexes on any of the fields in your tables.  :)
>
> Or maybe an RDBMS has too many search concepts baked in, and a flat file would
> be best.  :)
>
> Seriously... optimizing on-disk data structures to accommodate anticipated
> search query patterns and maximize speed and relevance... that's what
> indexing's all about, ain't it?

You're over-reading into what I said.

I mean specifically one should not have to commit to the precise
scoring model they will use for a given field, when they index that
field.

Many scoring models are possible if you store enough stats in the
index.

> And what class other than Similarity knows enough about the scoring algorithm
> to perform these data reduction tasks?  If it's not goint to be Similarity
> itself, it has to be something that know absolutely everything about the
> Similarity implementation's scoring model.

I don't follow this...

It will be Sim that does computes norm bytes.

I mean, other classes can go and look @ these stats if they want,
too... users will come up with neat uses over time :)

>> > Right.  However, now that I've thought about it, if a user indicates that a
>> > field is "match-only" by supplying a MatchSimilarity, we know that we can
>> > omit boost bytes.
>> >
>> > So we can re-conceive "MatchSimilarity" as being analogous to omitNorms.
>> > Huzzah!
>> >
>> > One down, one to go.  :)
>>
>> Hmm except shouldn't you allow omitting boost bytes but keeping term
>> freqs?  Ie all docs are roughly the same length (say, a title field)
>> and I never boost them?  How will you allow this?
>
> I think that you've described an uncommon use case, and it's tempting to just
> wave it off with the easy answer: you spec a Sim that writes such a format.

I don't think this is so uncommon?  (This is the omitNorms case in
Lucene today, except you still gotta index positions, until we decouple
the two = LUCENE-2048.  Such a nice round binary number for
remembering...).

> But here's where maybe Lucy can steal from the Lucene flex branch.

Yay: poaching!

> We can give Similarity a makePostingCodec() factory method.  Then,
> we can publish common PostingCodecs as public classes, allowing us
> to support different formats with minimal effort.
>
>  class MySim extends Similarity {
>    public PostingCodec makePostingCodec() {
>      StandardPostingCodec codec = new StandardPostingCodec();
>      codec.setOmitBoostBytes(true);
>      codec.setOmitPositions(true);
>      return (PostingCodec)codec;
>    }
>  }

This still feels like you are mixing two very different concepts --
what's being written (boost bytes, positions, docTermFreqs) vs how it's
encoded (codec).  Shouldn't Lucy's schema record what stats should be
indexed for the field?  Then, any codec you swap in should respect
that?  EG maybe I use PForCodec instead, or a PulsingCode(PForCodec)?

> (FWIW, you could theoretically do something similar with Lucene: supply one
> Sim at index time, but write precursors instead of boost bytes and allow a
> different Sim to be used at search-time.)

Except Sim shouldn't be used at indexing ;)

> This setup follows the easy-things-easy-hard-things-possible model, because
> the user doesn't have to know posting formats intimately to start optimizing
> away needless data, but experts like Earwin get the direct access they
> seemingly can't live without.

Yeah Earwin seems a good exemplar for the experts camp ;)

I'm thinking the various Sim classes, which you'd select during
searching, will note in jdocs what attrs must be indexed.  It's your
job to read that and set your field (schema) up accordingly, ie,
enable those required attrs. The Sim class will also check @ search
time and throw an exception if the precursors they require were not
indexed, with a clear message saying "field X is missing attr Y".

>> I agree it's not great to have to speak/think in low level indexing
>> attr concepts... because it forces user to translate to what that
>> means at search time.  But I still don't see a great alterntative.  I
>> don't like pushing the Sim choice all the way back into indexing.
>
> You make it sound like that's the way things have been done since forever is
> some radical experiment. :P
>
> Similarity choice IS made at index time.  Nobody's pushing it back -- you're
> proposing that it be pushed forward.

Right, today Similarity does impact indexing, and it's been actually a
source of surprise.

>> > Under Lucy, you can't switch to a different weighting model at search time
>> > because the boost bytes are baked into the index.  But you can still do
>> > doc-id-only posting iteration against any posting format since doc-id-only 
>> > is
>> > the minimum requirement for a posting list.
>> >
>> > So your question is predicated on the assumption that you need a
>> > doc-id-only Similarity to do doc-id-only postings iteration, but that's not
>> > true -- you need a doc-id-only PostingDecoder, which may be spawned by any
>> > Similarity.
>> >
>> > Does that make sense?
>>
>> It sounds like... if the user had used AllBellsAndWhistlesScoringSim
>> while indexing, they will still be able to use MatchOnlySim while
>> searching because under-the-hood MatchOnlySim knows how to pull a
>> docID only postings iterator from that field.
>
> You seem to be fixated on the notion of swapping in a MatchOnlySim object at
> search time.  You can't do that in KS/Lucy, because you can't modify a Schema
> at search-time, and the per-field Similarity assignments are part of the
> Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
> do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
> spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.

I am fixated because it's a glaring example (to me) of what's wrong
with forcing user to commit to how scoring is going to happen, at
index time, for that field.

And I'm still confused on how this'll work in Lucey -- if in my global
write-once Lucy scheme I bind a field during indexing to
AllBellsAndWhistlesScoringSim... then at search time, sure, it can
spawn a doc-id-only PostingDecoder... so that does mean I can do
match-only searching using that, somehow?  (Ie I can't change the
field to MatchOnlySim, but, I have a some workaround that lets me
achieve the same functionality...?).

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to