On Fri, Mar 5, 2010 at 1:54 PM, Marvin Humphrey <mar...@rectangular.com> wrote: > On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote: >> > In a multi-node search cluster, pre-calculating norms at index-time >> > wouldn't work well without additional communication between nodes to >> > gather corpus-wide stats. But I suspect the same trick that works >> > for IDF in large corpuses would work for average field length: it >> > will tend to be the stable over time, so you can update it >> > infrequently. >> >> Right I imagine we'd need to use this trick within a single index, >> too. Recomputing norms for entire index when only a small new segment >> was added to the new NRT reader will probably be too costly. > > Agreed. But you definitely want corpus-wide stats, because you're not > guaranteed to have consistently random distribution of field lengths across > nodes. > > Hoss had a good example illustrating why per-node IDF doesn't always work well > in a cluster: search cluster of news content with nodes divided by year, and > the top scoring hit for "iphone" is a misspelling from 1997 (because it was an > extremely rare term on that search node). > > Similarly, if you calc field length stats on one node where the "tags" field > averages 50 tokens and on another node where it averages 5, you're going to > get screwy results. > > Fortunately, beaming field length data around is an easier problem than > distributed IDF, because with rare exceptions, the number of fields in a > typical index is miniscule compared to the number of terms.
Right... so how do we control/configure when stats are fully recomputed corpus wide.... hmmm. Should be fully app controllable. >> Though one alternative (if you don't mind burning RAM) is to skip >> casting to norms, ie store the actual field length, and do the >> divide-by-avg during scoring (though that's a biggish hit to search >> perf). > > I suppose that's theoretically available to a codec if desired, but it > wouldn't ever be a first choice of mine. Yeah me neither. The divide-per-hit, and 4X (over Lucene's current boost RAM requirements = norms) is a killer. >> > token_counts: { >> > segment: { >> > title: 4, >> > content: 154, >> > }, >> > all: { >> > title: 98342, >> > content: 2854213 >> > } >> > } >> > >> > (Would that suffice? I don't recall the gory details of BM25.) >> >> I think so, though why store all, per segment? Reader can regen on >> open? (That above json comes from a single segment right?). > > You're right, no need to store "all", calculating on the fly is cheap. > >> lnu.ltc would need sum(avg(tf)) as well. > > Hmm, I was thinking you'd calc that on the fly, but then deriving the average > means you have to know the number of docs where the field was not null -- > which could be different from maxDoc() for the segment. > > I guess you'd want to accumulate that average while building the segment... > oh wait, ugh, deletions are going to make that really messy. :( > > Think about it for a sec, and see if you swing back to the desirability of > calculation on the fly using maxDoc(), like I just did. I think we'd store a float (holding avg(tf) that you computed when inverting that doc, ie, for all unique terms in the doc what's the avg of their freqs) for every doc, in the index. Then we can regen fully when needed right? Or maybe we store sum(tf) and #unique terms... hmm. Handling docs that did not have the field is a good point... but we can assign a special value (eg 0.0, or, any negative number say) to encode that? Deletions I think across the board will skew stats until they are reclaimed. >> >> The norms array will be stored in this per-field sim instance. >> > >> > Interesting, but that wasn't where I was thinking of putting them. >> > Similarity objects need to be sent over the network, don't they? At >> > least they do in KS. So I think we need a local per-field >> > PostingsReader object to hold such cached data. >> >> OK maybe not stored on them, but, accessible to them. Maybe cached in >> the SegmentReader. > > Well, I think SegmentReader should be as minimal as possible, with most of the > real action happening down in sub-readers -- so I think the cached norms > arrays belong in a sub reader. But we're almost on the same page. I think we're saying the same thing... it's just that SR is not componentized yet in Lucene. I agree if it were it would be a component within SR that holds this norms cache. >> > What do you do when you have to reconcile two posting codecs like this? >> > >> > * doc id, freq, position, part-of-speech identifier >> > * doc id, boost >> > >> > Do you silently drop all information except doc id? >> >> I don't know -- we haven't hit that yet ;) The closest we have is >> when <doc id> is merged with <doc id,freq,<position+>>, and in that >> case we drop the freq,<position+>. > > OK, I suppose that answers my question. I dislike the notion of silently > discarding data on merge conflict, as it becomes possible for one bunk > document to poison an entire index. But then I also dislike the notion of > inventing new data ex nihilo, as happens when resolving omitNorms. But then I > think the whole tangled mess is insane. > > In any case, so long as there's a resolution policy in place and any > Similarity or posting format codec can fall back to doc-id-only, you can move > on past this challenge. Yes, I agree neither is ideal. But I don't think we can change that in Lucene today. >> With flex this'll be up to the codec's merge methods. > > With the default being to fall back to doc-id-only and discard data when an > unknown posting format is encountered, I presume. It won't encounter an unknown posting format. It's the codec. It knows all posting formats by the time it sees it. (NOTE: too-long line is intentional -- still testing wrapping!). >> >> > Similarity is where we decode norms right now. In my opinion, it >> >> > should be the Similarity object from which we specify per-field >> >> > posting formats. >> >> >> >> I agree. >> > >> > Great, I'm glad we're on the same page about that. >> >> Actually [sorry] I'm not longer so sure I agree! >> >> In flex we have a separate Codec class that's responsible for >> creating the necessary readers/writers. It seems like Similarity is a >> consumer of these stats, but need not know what format is used to >> encode them on disk? > > It's true that it's possible to separate out Similarity as a consumer. > However, I'm also thinking about how to make this API as easy to use as > possible. > > One rationale behind the proposed elevation of Similarity is that I'm not a > fan of the name "Codec". I think it's too generic to use for the class which > specifies a posting format. "PostingCodec" is better, but might be too long. > In contrast, "Similarity" is more esoteric than "Codec", and thus conveys more > information. Well, Codec is intentionally generic -- currently it "only" serves up readers & writers for postings, but over time I expect it'll be the class Lucene uses to get reader/writer for other parts of the index. > For Lucy, I'm imagining a stripped-down Similarity class compared to current > Lucene. It would bear the responsibility for setting policy as to how scores > are calculated (in other words, judging how "similar" a document is to the > query), but what information it uses to calculate that score would be left > entirely open. Methods such as tf(), idf(), encodeNorm(), etc. would move to > a TF/IDF-specific subclass. Here's a sampling of possible Similarity > subclasses: > > * MatchSimilarity // core > * TFIDFSimilarity // core > * LongFieldTFIDFSimilarity // contrib > * BM25Similarity // contrib > * PartOfSpeechSimilarity // contrib > > For Lucy, Similarity would be specified as a member of a FieldType object > within a Schema. No subclassing would be required to spec custom posting > formats: > > Schema schema = new Schema(); > FullTextType bm25Type = new FullTextType(new BM25Similarity()); > schema.specField("content", bm25Type); > schema.specField("title", bm25Type); > StringType matchType = new StringType(new MatchSimilarity()); > schema.specField("category", matchType); > > Since the Similarity instance is settable rather than generated by a factory > method, that means it will have to be serialized within the schema JSON file, > just like analyzers must be. > > I think it's important to make choosing a posting format reasonably easy. > Match-only fields should be accessible to someone learning basic index tuning > and optimization techniques. > > Actually writing posting codecs is totally different. Not many people are > going to want to do that, though we should make it easy for experts. I'm a little confused: if I indexed a field with full postings data, shouldn't I still be allowed score with match only scoring? When a movie is encoded to a file, the codec(s) determine all sorts of interesting details. Then when you watch the movie you're free to do whatever you want -- watch as hidef, as normal def, cropped, sound only, listen to different languages, pick subtitles, etc. How it's specifically encoded is strongly decoupled from how you use it. > What's the flex API for specifying a custom posting format? You implement a Codecs class, which within it knows about any number of Codec impls that it can retrieve by name. Here's the default Codecs on flex now: class DefaultCodecs extends Codecs { DefaultCodecs() { register(new StandardCodec()); register(new IntBlockCodec()); register(new PreFlexCodec()); register(new PulsingCodec()); register(new SepCodec()); } @Override public Codec getWriter(SegmentWriteState state) { return lookup("Standard"); //return lookup("Pulsing"); //return lookup("Sep"); //return lookup("IntBlock"); } } getWriter returns the Codec that will write the current segment. Codec then has fieldsConsumer method, which returns a FieldsConsumer that Lucene will send all postings data to. It's a tiered API -- lucene adds a field and the FieldsConsumer returns a TermsConsumer, etc. >> > What's going to be a little tricky is that you can't have just one >> > Similarity.makePostingDecoder() method. Sometime's you'll want a >> > match-only decoder. Sometimes you'll want positions. Sometimes >> > you'll want part-of-speech id. It's more of a interface/roles >> > situation than a subclass situation. >> >> match-only decoder is handled on flex now by asking for the DocsEnum >> and then while iterating only using the .doc() (even if underlyingly >> the codec spent effort decoding freq and maybe other things). >> >> If you want positions you get a DocsAndPositionsEnum. > > Right. But what happens when you want a custom codec to use BM25 weighting > *and* inline a part-of-speech ID *and* use PFOR? You'd use the PForCodec, and make an attr that injects POS. Attrs are not fulling working on flex now (eg, they can't serialize/deserialize themselves... and we may need some way to differentiate "per position" attrs and "per doc" attrs), but, this is the game plan. > I think we have to supply a class object or class name when asking for the > enumerator, like you do with AttributeSource. > > PostingList plist = null; > PostingListReader pListReader = segReader.fetch(PostingListReader); > if (pListReader != null) { > PostingsReader pReader = pListReader.fetch(field); > if (pReader != null) { > plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList > } > } But is plist a "normal" postings iterator (ie, subclasses it) that has also exposed a dedicated POS API? In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr up front, and as you're next'ing your way through it, optionally look up the POS of each position you step through, using the POS attr. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org