On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote: > > So there would be polymorphism in the decoding phase while we're supplying > > information the Similarity object needs to make its similarity judgments. > > However, that polymorphism would be handled internally -- it wouldn't be the > > responsibility of the user to determine whether a codec supported a > > particular > > scoring model. > > Is that "yes" (a user can do MatchOnlySim at search time" if the field > were indexed with B25Sim)?
In essence, yes. Technically, no. Under the covers, doc-id-only postings iteration probably wouldn't be implemented by spawning a doc-id-only Similarity object. It would probably be something more like, ask the Similarity for a PostingDecoder with no extra attributes. And then docID-freq-boost postings iteration might be achieved by asking the Similarity for a PostingDecoder with TermFreq and DocBoost attributes. > How will Lucy "know" which switchups (Sim at indexing vs Sim at > searching) are "OK"... I think the theme is that each Similarity class will have a whitelist of supported posting iteration configurations. So long as the requested config is in the whitelist, you get an iterator back -- otherwise, you get NULL. Exactly what form the request specification would take, that's up in the air. But it would be an implementation detail for now. So long as the file format supports the data, we can build an iterator that reads it, regardless of encoding. > >> Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP" > >> instead of saying "Field.matchOnly" (or something). So I do agree > >> that it'd be better if the API made it clear what the *search* time > >> impact is of using this advanced Field API. > > > > In my opinion, it makes sense to communicate "match only" by way of the > > Similarity object as opposed to a boolean. I think it's a good way to > > introduce the Similarity class and get people comfortable with it, and I > > also > > think that it's good to keep stuff out of the FieldType API when we can. > > But say we want to also allow storing tf but not positions, because > really the two choices should not be coupled (as they are today with > Lucene's omitTFAP). > > So I have omitTF and omitP (only 3 combos are allowed -- must omitP if > you omitTF). > > What Sim do you call that at indexing time? Well, those are pretty esoteric posting formats. It's common to not need scores and therefore not need boost bytes (the Lucene omitNorms case). It's also common to not need any matching info beyond doc id (the Lucene omitTFAP case). But omitTF and omitP aren't common needs, or Lucene would have them by now, right? And since they are infrequently used, Huffman-driven naming philosophy suggests that they should have long, low-value names: OmitPositionsSimilarity, OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be an accurate abbreviation in this scenario as opposed to the current Lucene omitTFAP). In other words, I don't much care what those are named because they aren't likely to be used except by people who A) have very, very specific use cases and B) really know what they're doing. In contrast, I think it's important that we come up with good names for the doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases. > >> We get users who are baffled that their phrase queries no longer work > >> after setting omitTFAP. > > > > This is still a weakness of MatchSimilarity. > > Well MatchSimilarity arguably should mean "match all queries > correctly, just don't score them". Ie, positional queries should in > fact work... just not receive a score. Right. However, now that I've thought about it, if a user indicates that a field is "match-only" by supplying a MatchSimilarity, we know that we can omit boost bytes. So we can re-conceive "MatchSimilarity" as being analogous to omitNorms. Huzzah! One down, one to go. :) > > On the other hand, typical candidates for MatchSimilarity... > > > > * unique_id > > * category > > * tags > > > > ... either won't contain multiple tokens, or won't generally return sensible > > results for phrase queries. > > Maybe we need to splinter MatchSim into the two cases. Whether > positions are stored, and whether scoring is done, is really > orthogonal. Maybe "MinimalSimilarity" as the analogue for Lucene omitTFAP? I dunno, that might be kind of generic, but maybe it makes sense in context. The idea is to get the user to describe how the field will be scored. Based on that info, we can customize the posting format, possibly making optimizations and omitting certain posting data. When people ask on the user list... "How can I make my index smaller?" ... we can reply like so: "Make some fields match-only by specifying MatchSimilarity in the FieldType, or even better if you don't need phrase queries, by specifying MinimalSimilarity. You'll be throwing away data Lucy needs for sophisticated queries, but your index will get smaller." I think that response is easier to understand than a response instructing them to "enable omitNorms", and it introduces the very important Similarity class rather than the confusing, overloaded, and not-very-useful terminology, "norms". > >> > They could use better codecs under the format-follows-Similarity model, > >> > too. > >> > They'd just have to subclass and override the factory methods that spawn > >> > posting encoders/decoders. > >> > >> Ahh, OK so that's how they'd do it. > >> > >> So... I think we're making a mountain out of a molehill. > > > > Well, I don't see it that way, because I place great value on designing > > good public APIs, and I think it's important that we avoid forcing users to > > know about codecs. > > I had thought we were bickering about whether you subclass & override > a method (to alter the codec) (= Lucy) vs you create your own > Codec/CodecProvider and pass that to your writer, which seems..... a > minor difference. > > If the user is not tweaking the codec, they don't have to do anything > with codes (the defaults work) for either Lucy or Lucene. > > So the only difference is the specifics of how the codec-tweaking-user > in fact alters the codec. I don't think that's the only difference. What does the novice user know about "PFOR", about "pulsing", about "group varint", etc? They aren't going to know jack. So how are you expecting them to distinguish between various Codec subclasses named after those high-falutin' concepts? The difference is that you're forcing the novice user to learn esoteric material just to get started, while the format-follows-sim model is trying to spare the novice yet enable the expert. Users shouldn't have to distinguish between "codecs" until they are actually ready to write their own. As we discussed on IRC yesterday, the number of people who will be qualified to write posting codec code will still be very small, even after we finish this democratization push. It will be a big step forward if we can just get more Lucene committers to grok the inner workings of posting lists. However, there are some very useful optimizations that will be underutilized by the user base if the public API uses jargon like "omitTFAP" and "PFORCodec" that shuts out everyone except elite developers. > > Under format-follows-Sim, it would be the Similarity object that knows all > > supported decoding configurations for the field. > > I'm still hazy on how you'll know at search time which Sims are > "congruent" with what's stored in the index.... ie that downgrading to > MatchOnlySim is allowed, but swapping to a different scoring model is > not (because norms are committed at indexing time). I'm not sure that e.g. TermScorer would even know what Similarity it was dealing with. It would ask for a boost-byte decoder from the sim, but it wouldn't have to know or care how the boost bytes got translated to float multipliers. Under Lucy, you can't switch to a different weighting model at search time because the boost bytes are baked into the index. But you can still do doc-id-only posting iteration against any posting format since doc-id-only is the minimum requirement for a posting list. So your question is predicated on the assumption that you need a doc-id-only Similarity to do doc-id-only postings iteration, but that's not true -- you need a doc-id-only PostingDecoder, which may be spawned by any Similarity. Does that make sense? Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org