On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote: > I mean specifically one should not have to commit to the precise > scoring model they will use for a given field, when they index that > field.
Yeah, I've never seen committing to a precise scoring model at index-time via Sim choice as a big deal. In Lucy, per-field Similarity assignments are part of the the Schema, which has to be set at index-time. And index-time Sim choice is the way things have always been done in Lucene. In any case, the proposal to start delaying Sim choice to search-time -- while a nice feature for Lucene -- is a non-starter for Lucy. We can't do that because it would kill the cheap-Searcher model to generate boost bytes at Searcher construction time and cache them within the object. We need those boost bytes written to disk so we can mmap them and share them amongst many cheap Searchers. So... you're proposing shrinking Similarity's public API by removing functionality that Lucy can't live without. If indeed that works out for Lucene, the role of Similarity within the two libraries will have to diverge. In Lucene, Similarity will get smaller; in Lucy it will expand a bit. To my mind, these are all related data reduction tasks: * Omit doc-boost and field-boost, replacing them with a single float docXfield multiplier -- because you never need doc-boost on its own. * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost, replacing them all with a single boost byte -- because for the kind of scoring you want to do, you don't need all those raw stats. * Omit the boost byte, because you don't need to do scoring at all. * Omit positions because you don't need PhraseQueries, etc. to match. * Omit everything except doc-id, because you only need binary matching. What al those tasks all have in common is that we can determine what stats are disposable based on how the user describes how they are going to use the field. For Lucy, the user is going to have to commit to a "precise scoring model" at index-time by specifying a Sim choice anyway. If that Sim turns out to be a MatchSimilarity, why on earth should we keep around the boost bytes? > > And what class other than Similarity knows enough about the scoring > > algorithm > > to perform these data reduction tasks? If it's not goint to be Similarity > > itself, it has to be something that know absolutely everything about the > > Similarity implementation's scoring model. > > I don't follow this... > > It will be Sim that does computes norm bytes. I meant that if you're writing out boost bytes, there's no sensible way to execute the lossy data reduction and reduce the index size other than having Sim do it. > > class MySim extends Similarity { > > public PostingCodec makePostingCodec() { > > StandardPostingCodec codec = new StandardPostingCodec(); > > codec.setOmitBoostBytes(true); > > codec.setOmitPositions(true); > > return (PostingCodec)codec; > > } > > } > > This still feels like you are mixing two very different concepts -- > what's being written (boost bytes, positions, docTermFreqs) vs how it's > encoded (codec). So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()? Maybe that's right. Guess I'll watch to see how flex pans out and what methods you put on those PostingCodec classes. For now, I just want to make the no-boost-bytes and doc-id-only index optimizations available, and to achieve that, it's sufficient to implement format-follows-sim and publish MatchSimilarity and MinimalSimilarity. The PostingCodec API can remain a private implementation detail until a later date. > Shouldn't Lucy's schema record what stats should be indexed for the field? No, it shouldn't -- not directly. You tell the Schema how you want the field to be used. That information is used to derive what stats are needed, and whether the ones that are needed can be combined, compressed, etc. > Then, any codec you swap in should respect that? EG maybe I use PForCodec > instead, or a PulsingCode(PForCodec)? I guess. I don't see publishing a PForCodec with an elaborate API as being very important, though. It's more important to just use PFOR internally when it's the best choice. > I'm thinking the various Sim classes, which you'd select during > searching, will note in jdocs what attrs must be indexed. It's your > job to read that and set your field (schema) up accordingly, ie, > enable those required attrs. Yeah, that'll at least get the job done for Lucene. I don't think it's ideal to force people to understand that stuff, but hey, the more people are confused, the more important it is for them to buy optimization seminars where Lucene gurus explain all the obscure incantations to them. :) > > You seem to be fixated on the notion of swapping in a MatchOnlySim object at > > search time. You can't do that in KS/Lucy, because you can't modify a > > Schema > > at search-time, and the per-field Similarity assignments are part of the > > Schema. But *it doesn't matter* because you don't need a MatchOnlySim to > > do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can > > spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can. > > I am fixated because it's a glaring example (to me) of what's wrong > with forcing user to commit to how scoring is going to happen, at > index time, for that field. Haha, well that would sure suck if it didn't work! But I'm telling you it's no problem. > And I'm still confused on how this'll work in Lucey -- if in my global > write-once Lucy scheme I bind a field during indexing to > AllBellsAndWhistlesScoringSim... then at search time, sure, it can > spawn a doc-id-only PostingDecoder... so that does mean I can do > match-only searching using that, somehow? Of course. Lucene can't do that? No way, that can't be right! I've gotta be missing something. (Though I guess that would explain the fixation on needing a different Sim.) Needing a special Sim for match-only seems like an absurd limitation -- I mean the doc id data is there, and you don't need scores. You've gotta be able to fake it at least. > (Ie I can't change the field to MatchOnlySim, but, I have a some workaround > that lets me achieve the same functionality...?). It's not a workaround. Things just work that way. Without getting into the gory details... if you're not calculating a score, you don't need Similarity's functionality. If Lucene still needs a Sim object despite not needing its functionality, that's just an accident of the OO design, and it so happens that our "loose C" port doesn't have the same quirk. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org