Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Mon, 08 Mar 2010 10:14:20 -0800

On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
> On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
>> It won't encounter an unknown posting format.  It's the codec.  It
>> knows all posting formats by the time it sees it.
>
> OK, so you're not going to handle this the way Lucene handles field types and
> accept a new codec spec reference with each field in each Document.


Right.

> There will be per-index associations between field names and codecs
> and it will be invalid to change those associations.

Well, per-segment.  So different IndexWriter sessions could use a
different Codec for the segments they write.  Every time a segment
needs to be written (during flush or during merge) Lucene calls
Codecs.getWriter to get the Codec to use for writing this one segment.

And it's not per-field; it's for all fields.  But we have a
PerFieldCodecWrapper (currently in unit test, but I think we should
promote it).

I do agree it'd be great to eventually consolidate all this field
configuration in Lucene... and not have any more PerFieldThisWrapper
and PerFieldThatWrapper...

I think we can actually do so w/o losing Lucene's loose typing if we
simply peeled out [say] a FieldType class that holds the settings you
now set on each field (omitTFAP, omitNorms, TermVector, Store,
Index), and Field instance holds a ref to its FieldType.  We could
then store Analyzer and Codec on there, too.

Lucene would still be "loosely typed" (ie, no global schema) in that
every time you index new docs you're free to make a up a new FieldType
instance (ie it wouldn't be stored in the index -- it's "stored" in
your app's java sources), though probably FieldType itself would be
write once during an IndexWriter session.

Hmm big change though -- I don't want to gate landing flex with this.

>> Well, Codec is intentionally generic -- currently it "only" serves up
>> readers & writers for postings, but over time I expect it'll
>> be the class Lucene uses to get reader/writer for other parts of the
>> index.
>
> Huh?  What does the posting format specifier have to do with e.g. stored
> fields?
>
> What you're describing sounds more like the Architecture class in KinoSearch.

OK.

>> I'm a little confused: if I indexed a field with full postings data,
>> shouldn't I still be allowed score with match only scoring?
>
> Of course.
>
>> When a movie is encoded to a file, the codec(s) determine all sorts of
>> interesting details.  Then when you watch the movie you're free to do
>> whatever you want -- watch as hidef, as normal def, cropped, sound
>> only, listen to different languages, pick subtitles, etc.  How it's
>> specifically encoded is strongly decoupled from how you use it.
>
> I see what you're getting at.  However, Similarity *already* affects the
> contents of the index, via encodeNorm()/decodeNorm() and lengthNorm().  So if
> you want to divorce Similarity from index format, you'll need to remove those
> methods.

This brings us full circle -- it's exactly what I'd like to do as the
baby step ;)

Ie, lengthNorm would no longer be publicly used (since, instead, the
true stats are written to the index).  (Privately, within Sim impls
it'd presumably still be used).

encode/decodeNorm would also be private to the Sim impl -- that's just
a way to quantize a float into a single byte, to save RAM.  Other Sim
impls may just want to store a float directly, use 2 bytes to quantize
floats, use only 4 bits per norm, don't store anything (match only),
etc.

> In my opinion, it makes more sense to go the opposite direction, and have
> Similarity objects spawn PostingEncoder objects which define the index format.
> The ability of a search-time Similarity object to make relevance judgements
> and assign scores is intimately tied to the information prepared for it in
> advance and written at index-time.

I'm still not quite seeing so strong a connection...

Yes Sim needs many "facts" to use for its decision making
(length-in-tokens for each docXfield, avg(tf) for each docXfield,
docFreq(term), etc.), but how those facts are encoded seems
orthogonal.

I do agree there's some connection -- if I don't store tf nor
positions then I can't use a Sim that needs these stats.

> I also like the idea of novice/intermediate users being able to express the
> intent for how a field gets scored by choosing a Similarity subclass, without
> having to worry about the underlying details of posting format.

Well.. I think standard codec in Lucene will store these 2 common
stats (field length, avg(tf)), then provide various Sim impls?  So w/
default codec user can still pick the Sim impl that does the scoring
they want?

If user switches up their codec then they'll need to ensure it also
stores stats required by their Sim(s).

>> > What's the flex API for specifying a custom posting format?
>>
>> You implement a Codecs class, which within it knows about any number
>> of Codec impls that it can retrieve by name.
>
> So you have both a class named "Codec" and a class named "Codecs"?  :(
>
> Tell me, is this an array of Codecs or a Codecs?
>
>   return codecs;

Probably a Codecs instance ;)

Yeah it's not ideal... maybe rename Codecs -> CodecProvider?
CodecFactory?  Codecs purpose is to 1) provide the Codec that'll
write a new segment, and 2) lookup codecs by String name (when reading
previously written segments).

>> Here's the default
>> Codecs on flex now:
>>
>> class DefaultCodecs extends Codecs {
>>   DefaultCodecs() {
>>     register(new StandardCodec());
>>     register(new IntBlockCodec());
>>     register(new PreFlexCodec());
>>     register(new PulsingCodec());
>>     register(new SepCodec());
>>   }
>>
>>   @Override
>>   public Codec getWriter(SegmentWriteState state) {
>>     return lookup("Standard");
>>     //return lookup("Pulsing");
>>     //return lookup("Sep");
>>     //return lookup("IntBlock");
>>   }
>> }
>>
>> getWriter returns the Codec that will write the current segment.
>
> So...
>
>  * The user needs to know about SegmentWriteState?

Well, the codec dev does.  A "user" (even one who wants to try out
different codecs others have written) doesn't.

>  * The "codec" is per-index, not per-field?  Presumably this will change?

Per-index (see above).

>  * The "codec" is a writer in this case, but since the name "codec" implies
>    both coding and decoding, it must also be capable of functioning as a
>    reader?

Codec has fieldsConsumer (write a segment) & fieldsProducer (read a
segment) methods.  Codecs has a Codec lookup(String) method, which
retrieves the named codec.

>> > Right.  But what happens when you want a custom codec to use BM25 weighting
>> > *and* inline a part-of-speech ID *and* use PFOR?
>>
>> You'd use the PForCodec, and make an attr that injects POS.
>
> OK.
>
> I don't think we're likely to do things that way in Lucy.  The functions which
> decode postings will operate directly on raw mmap'd memory, and they typically
> won't make any external calls to either methods or non-inline functions.

Yeah this is also an option in flex (bake in all attrs you want, as a
custom specialized codec).  But I think once attrs can
serialize/deserialize, any codec (at least our core codecs) should put
foreign attrs into the postings.

In fact you could argue what the standard codec does today (encoding
doc/freq/pos/payload) has already "baked in" attrs that you could have
done separately as true attrs.

> If you wanted to use an esoteric custom format, you'd write your own decoder
> function.  There won't be a lot of code reuse at this inner-loop level --
> unrolling will be the rule rather than the exception.

Yes... different rules apply "down low".

>> > I think we have to supply a class object or class name when asking for the
>> > enumerator, like you do with AttributeSource.
>> >
>> >  PostingList plist = null;
>> >  PostingListReader pListReader = segReader.fetch(PostingListReader);
>> >  if (pListReader != null) {
>> >    PostingsReader pReader = pListReader.fetch(field);
>> >    if (pReader != null) {
>> >      plist = pReader.makePostingList(klass); // e.g. 
>> > PartOfSpeechPostingList
>> >    }
>> >  }
>>
>> But is plist a "normal" postings iterator (ie, subclasses it) that has
>> also exposed a dedicated POS API?
>
> It's definitely a "normal" postings iterator.  As to whether we expose the
> part-of-speech via an attribute or via a method, that's up in the air.  Hmm.
>
> From a class-design perspective, it would probably be best to go with an
> attribute, since Lucy has only single-inheritance and no interfaces.  A rigid
> class hierarchy is going to cause problems when you need an iterator that
> combines unrelated concepts like BM25 weighting and part-of-speech tagging.

OK.

>> In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr
>> up front, and as you're next'ing your way through it, optionally look
>> up the POS of each position you step through, using the POS attr.
>
> Just a thought: why not make positions an attribute on a DocsEnum?

Maybe... though I think the double method call (enum.next() then
posAttr.get()) is too much added cost.

I do think we should remove payloads from the API.  Attrs should
simply serialize themselves... but we're still hashing out just how
serialization should work (LUCENE-2125)...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to