Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Mon, 08 Mar 2010 18:48:17 -0800

On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote:
> I think we can actually do so w/o losing Lucene's loose typing if we
> simply peeled out [say] a FieldType class that holds the settings you
> now set on each field (omitTFAP, omitNorms, TermVector, Store,
> Index), and Field instance holds a ref to its FieldType.  We could
> then store Analyzer and Codec on there, too.


You can use shared FieldType instances to hold typing information without
enforcing consistency.

> Lucene would still be "loosely typed" (ie, no global schema) in that
> every time you index new docs you're free to make a up a new FieldType
> instance (ie it wouldn't be stored in the index -- it's "stored" in
> your app's java sources), though probably FieldType itself would be
> write once during an IndexWriter session.

For what it's worth, that's sort of the way KS used to work: Schema/FieldType
information was stored entirely in source code.  That's changed and now we
serialize the whole schema including all Analyzers, but source-code-only is a
viable approach.

> Hmm big change though -- I don't want to gate landing flex with this.

Perhaps factoring out FieldType from Field can be done on trunk, now?   From a
distance, it looks to be a straightforward subtractive refactoring.

> > I see what you're getting at.  However, Similarity *already* affects the
> > contents of the index, via encodeNorm()/decodeNorm() and lengthNorm().  So 
> > if
> > you want to divorce Similarity from index format, you'll need to remove 
> > those
> > methods.
> 
> This brings us full circle -- it's exactly what I'd like to do as the
> baby step ;)
>
> Ie, lengthNorm would no longer be publicly used (since, instead, the
> true stats are written to the index).  (Privately, within Sim impls
> it'd presumably still be used).
>
> encode/decodeNorm would also be private to the Sim impl -- that's just
> a way to quantize a float into a single byte, to save RAM.  Other Sim
> impls may just want to store a float directly, use 2 bytes to quantize
> floats, use only 4 bits per norm, don't store anything (match only),
> etc.

OK, I see.  Note that although it would mean writing redundant data, Lucy
could theoretically record the same raw stats.  It's just that Lucene would
generate the derived data structures at search-time, while Lucy would generate
them at index-time and then mmap the files at search-time.

I don't think we'd do that, though -- we'd just accept the lossiness and write
out the derived data -- but preparing per-docXfield boost/norm info involves
approximately the same amount of work no matter how you time-shift it.

> I do agree there's some connection -- if I don't store tf nor
> positions then I can't use a Sim that needs these stats.
>
> > I also like the idea of novice/intermediate users being able to express the
> > intent for how a field gets scored by choosing a Similarity subclass, 
> > without
> > having to worry about the underlying details of posting format.
> 
> Well.. I think standard codec in Lucene will store these 2 common
> stats (field length, avg(tf)), then provide various Sim impls?  So w/
> default codec user can still pick the Sim impl that does the scoring
> they want?

OK, that's actually handy, because it allows people to tweak length
normalization without reindexing and presumably speeds up development.  Of all
the knobs that Similarity gives us, lengthNorm() is far and away the most
important.

I guess you're OK with slowing down standard Lucene index opens to achieve
this flexibility, since you're going to burn CPU deriving those boost/norm
stats.  Subtle way of encouraging people to use the NRT API, eh?

I can see why you're so resistant to the idea of tying Similarity to format,
now.  However, I think you've managed to persuade me that it's exactly the
right thing to do from an API standpoint.  :)

Probably our perspectives and priorities diverge because of the fact that in
Lucene, Similarity is index-wide, while in Lucy/KS, it's per-field.  E.g from
your perspective, match-only indexes would be pretty esoteric, but from my
perspective, match-only fields make perfect sense.

> If user switches up their codec then they'll need to ensure it also
> stores stats required by their Sim(s).

That's backwards, IMO.

The posting format encoding should be an implementation detail.  The general
user should be expressing their intent as far as how they want the field to be
scored, and the posting format should flow from that.  

Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under
the hood to implement BM25, match-only, boost-per-position or whatever
shouldn't be the user's concern.  As time goes on, we should allow ourselves
the flexibility to use new compression techniques to write new segments.

> > Just a thought: why not make positions an attribute on a DocsEnum?
> 
> Maybe... though I think the double method call (enum.next() then
> posAttr.get()) is too much added cost.

Why wouldn't it work to have the consumer extract the positions attribute from
the DocsEnum during construction?  There's no difference between calling
enum.nextPosition() and positions.next(), is there?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to