On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote: > I still don't think similarity should have any bearing during indexing.
Similarity has always, from day one, affected the contents of the index. This idea that it should be totally divorced from indexing is, in fact, a very significant change that you are proposing for Lucene, and it will require non-trivial changes to the file format. For starters, you're going to at least double the footprint of the norms. For fields with more than 127 tokens or 127 unique terms, the increase will be greater... and if the user sets doc-boost and field-boost in a pattern that defies RLE compression, the footprint will be greater still. I happen to think that limited search-time settability of Similarity offers a nice feature -- the ability to futz with different weighting models and length normalization settings without reindexing -- and that it's worth exploring in pursuit of this feature. But by opting to forego the lossy compression now performed by encodeNorm() at index-time and store precursor statistics instead, we are going to take a hit on index size even with lossless compression. Furthermore, delaying Similarity choice means that it becomes the user's responsibility to ensure that index-time Codec choice is compatible with search-time Similarity choice. In contrast, setting Similarity at index-time means that the core gets to pick the Codec and can ensure that all the necessary data gets encoded, sparing the user from having to understand the gory details of posting formats. In summary, I think search-time setting of Similarity is a nice feature but a poor requirement. I'm not persuaded that this proposal to banish Similarity from index-time is wise. > But I don't like baking in search concepts at index time... Then you ought to use a traditional RDBMS rather than an indexing engine, and make sure you don't put indexes on any of the fields in your tables. :) Or maybe an RDBMS has too many search concepts baked in, and a flat file would be best. :) Seriously... optimizing on-disk data structures to accommodate anticipated search query patterns and maximize speed and relevance... that's what indexing's all about, ain't it? And what class other than Similarity knows enough about the scoring algorithm to perform these data reduction tasks? If it's not goint to be Similarity itself, it has to be something that know absolutely everything about the Similarity implementation's scoring model. > > Right. However, now that I've thought about it, if a user indicates that a > > field is "match-only" by supplying a MatchSimilarity, we know that we can > > omit boost bytes. > > > > So we can re-conceive "MatchSimilarity" as being analogous to omitNorms. > > Huzzah! > > > > One down, one to go. :) > > Hmm except shouldn't you allow omitting boost bytes but keeping term > freqs? Ie all docs are roughly the same length (say, a title field) > and I never boost them? How will you allow this? I think that you've described an uncommon use case, and it's tempting to just wave it off with the easy answer: you spec a Sim that writes such a format. But here's where maybe Lucy can steal from the Lucene flex branch. We can give Similarity a makePostingCodec() factory method. Then, we can publish common PostingCodecs as public classes, allowing us to support different formats with minimal effort. class MySim extends Similarity { public PostingCodec makePostingCodec() { StandardPostingCodec codec = new StandardPostingCodec(); codec.setOmitBoostBytes(true); codec.setOmitPositions(true); return (PostingCodec)codec; } } (FWIW, you could theoretically do something similar with Lucene: supply one Sim at index time, but write precursors instead of boost bytes and allow a different Sim to be used at search-time.) This setup follows the easy-things-easy-hard-things-possible model, because the user doesn't have to know posting formats intimately to start optimizing away needless data, but experts like Earwin get the direct access they seemingly can't live without. > I agree it's not great to have to speak/think in low level indexing > attr concepts... because it forces user to translate to what that > means at search time. But I still don't see a great alterntative. I > don't like pushing the Sim choice all the way back into indexing. You make it sound like that's the way things have been done since forever is some radical experiment. :P Similarity choice IS made at index time. Nobody's pushing it back -- you're proposing that it be pushed forward. > > Under Lucy, you can't switch to a different weighting model at search time > > because the boost bytes are baked into the index. But you can still do > > doc-id-only posting iteration against any posting format since doc-id-only > > is > > the minimum requirement for a posting list. > > > > So your question is predicated on the assumption that you need a > > doc-id-only Similarity to do doc-id-only postings iteration, but that's not > > true -- you need a doc-id-only PostingDecoder, which may be spawned by any > > Similarity. > > > > Does that make sense? > > It sounds like... if the user had used AllBellsAndWhistlesScoringSim > while indexing, they will still be able to use MatchOnlySim while > searching because under-the-hood MatchOnlySim knows how to pull a > docID only postings iterator from that field. You seem to be fixated on the notion of swapping in a MatchOnlySim object at search time. You can't do that in KS/Lucy, because you can't modify a Schema at search-time, and the per-field Similarity assignments are part of the Schema. But *it doesn't matter* because you don't need a MatchOnlySim to do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org