Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Tue, 09 Mar 2010 07:30:36 -0800

On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote:
> > For what it's worth, that's sort of the way KS used to work: 
> > Schema/FieldType
> > information was stored entirely in source code.  That's changed and now we
> > serialize the whole schema including all Analyzers, but source-code-only is 
> > a
> > viable approach.
> 
> Hmm but KS still somehow enforced strong typing across indexing
> sessions?


Nope, it wasn't enforced.

> You said "of course" before but... how in your proposal could one
> store all stats for a given field during indexing, but then sometimes
> use match-only and sometimes full-scoring when querying against that
> field?

The same way that Lucene knows that sometimes it needs a docs-only-enum and
sometimes it needs a docs-and-positions enum.  Sometimes you need scores,
sometimes you don't.

> >> If user switches up their codec then they'll need to ensure it also
> >> stores stats required by their Sim(s).
> >
> > That's backwards, IMO.
> 
> I'm still baffled.  If I wanna play a movie on my 1080P monitor I'll
> need to find a movie that was encoded hidef (ie, bluray not dvd).
> 
> I mean, I don't have to.  DVD content will play fine still... just
> degraded quality.

Heh.  Consumers hate format wars....

In this case, though, we're dealing with software, not DVD hardware, so
upgrading is a lot easier.  Under the format-follows-Similarity model, the
relationship between Similarity and posting format is more akin to the
relationship between a container format like Quicktime and codecs like
Sorenson 3 or H.264.  

Tweakers will want to go in and monkey with the choice of codec within the
Quicktime file, but most users will just trust us to use the latest and
greatest.

> > The posting format encoding should be an implementation detail.  The general
> > user should be expressing their intent as far as how they want the field to 
> > be
> > scored, and the posting format should flow from that.
> 
> Maybe it's that it bothers you that with this proposed changed the
> user makes 2 decisions -- Codec and Sim?  

Yes, and it bothers me that users have to know about codecs at all, when in
the vast majority of cases it doesn't matter because the default is going to
be the best choice.

Since compression algorithm performance depends on knowing how to exploit
patterns in the data and sometimes the user will know about patterns that are
opaque to us, in some circumstances they will be able to select a more
appropriate codec.  But that's not the common case, as it requires both
unusual data and an unusually sophisticated user.

What users will be able to tell us is how they want the field to be used, and
we can use that information to help us optimize.  For example, when a user
declares that they want a field to be "match-only", we know we don't have to
write boost bytes, freq or positions, saving space.

> Ie user will choose PFor or Standard or Pulsing(PFor/Standard) codec, and
> then separately choose Sim?
> 
> But these are important choices.  They should be separate.  Why
> force-bundle them?

Because most of the time the user isn't going to be able to improve on the
default.

> > Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under
> > the hood to implement BM25, match-only, boost-per-position or whatever
> > shouldn't be the user's concern.  As time goes on, we should allow ourselves
> > the flexibility to use new compression techniques to write new segments.
> 
> But w/ the proposed change Lucene users will be free to use better
> codecs? 

They could use better codecs under the format-follows-Similarity model, too.
They'd just have to subclass and override the factory methods that spawn
posting encoders/decoders.

> Are you worried about proper defaulting?  We'll handle that
> (under Version).

I don't think it's necessary or desirable to handle this with Version.  A
codec improvement (say, encoding match-only fields using PFOR instead of
VInts) would simply trigger an index format number increment, and new segments
would be written using the latest format.

> > There's no difference between calling enum.nextPosition() and
> > positions.next(), is there?
> 
> Right now it's a 2 step process when you access via attr -- first you
> ask the enum to next(), then you ask each attr associated w/ that enum
> for their value.

OK, I think I see where the limitation arises.

In Lucy/KS, we'd just access the positions value as a member variable (direct
struct access) rather than invoking a method.  By default, struct definitions
are opaque and thus member vars are inaccessible (to encourage loose
coupling), but we override that in certain cases for performance.

However, direct struct access requires a direct inheritance guarantee, while
"attributes" in Lucene only guarantee interface compliance.  You don't want to
use the stronger, more constrictive check, right?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to