On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote: > I think we can actually do so w/o losing Lucene's loose typing if we > simply peeled out [say] a FieldType class that holds the settings you > now set on each field (omitTFAP, omitNorms, TermVector, Store, > Index), and Field instance holds a ref to its FieldType. We could > then store Analyzer and Codec on there, too.
You can use shared FieldType instances to hold typing information without enforcing consistency. > Lucene would still be "loosely typed" (ie, no global schema) in that > every time you index new docs you're free to make a up a new FieldType > instance (ie it wouldn't be stored in the index -- it's "stored" in > your app's java sources), though probably FieldType itself would be > write once during an IndexWriter session. For what it's worth, that's sort of the way KS used to work: Schema/FieldType information was stored entirely in source code. That's changed and now we serialize the whole schema including all Analyzers, but source-code-only is a viable approach. > Hmm big change though -- I don't want to gate landing flex with this. Perhaps factoring out FieldType from Field can be done on trunk, now? From a distance, it looks to be a straightforward subtractive refactoring. > > I see what you're getting at. However, Similarity *already* affects the > > contents of the index, via encodeNorm()/decodeNorm() and lengthNorm(). So > > if > > you want to divorce Similarity from index format, you'll need to remove > > those > > methods. > > This brings us full circle -- it's exactly what I'd like to do as the > baby step ;) > > Ie, lengthNorm would no longer be publicly used (since, instead, the > true stats are written to the index). (Privately, within Sim impls > it'd presumably still be used). > > encode/decodeNorm would also be private to the Sim impl -- that's just > a way to quantize a float into a single byte, to save RAM. Other Sim > impls may just want to store a float directly, use 2 bytes to quantize > floats, use only 4 bits per norm, don't store anything (match only), > etc. OK, I see. Note that although it would mean writing redundant data, Lucy could theoretically record the same raw stats. It's just that Lucene would generate the derived data structures at search-time, while Lucy would generate them at index-time and then mmap the files at search-time. I don't think we'd do that, though -- we'd just accept the lossiness and write out the derived data -- but preparing per-docXfield boost/norm info involves approximately the same amount of work no matter how you time-shift it. > I do agree there's some connection -- if I don't store tf nor > positions then I can't use a Sim that needs these stats. > > > I also like the idea of novice/intermediate users being able to express the > > intent for how a field gets scored by choosing a Similarity subclass, > > without > > having to worry about the underlying details of posting format. > > Well.. I think standard codec in Lucene will store these 2 common > stats (field length, avg(tf)), then provide various Sim impls? So w/ > default codec user can still pick the Sim impl that does the scoring > they want? OK, that's actually handy, because it allows people to tweak length normalization without reindexing and presumably speeds up development. Of all the knobs that Similarity gives us, lengthNorm() is far and away the most important. I guess you're OK with slowing down standard Lucene index opens to achieve this flexibility, since you're going to burn CPU deriving those boost/norm stats. Subtle way of encouraging people to use the NRT API, eh? I can see why you're so resistant to the idea of tying Similarity to format, now. However, I think you've managed to persuade me that it's exactly the right thing to do from an API standpoint. :) Probably our perspectives and priorities diverge because of the fact that in Lucene, Similarity is index-wide, while in Lucy/KS, it's per-field. E.g from your perspective, match-only indexes would be pretty esoteric, but from my perspective, match-only fields make perfect sense. > If user switches up their codec then they'll need to ensure it also > stores stats required by their Sim(s). That's backwards, IMO. The posting format encoding should be an implementation detail. The general user should be expressing their intent as far as how they want the field to be scored, and the posting format should flow from that. Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under the hood to implement BM25, match-only, boost-per-position or whatever shouldn't be the user's concern. As time goes on, we should allow ourselves the flexibility to use new compression techniques to write new segments. > > Just a thought: why not make positions an attribute on a DocsEnum? > > Maybe... though I think the double method call (enum.next() then > posAttr.get()) is too much added cost. Why wouldn't it work to have the consumer extract the positions attribute from the DocsEnum during construction? There's no difference between calling enum.nextPosition() and positions.next(), is there? Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org