Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? It may be more than just changing the documentation...
Why would e.g. TermQuery need to write specialized code for these cases? I looked at TermScorer, and its freq() just returns docsEnum.freq(). I think that Similarity may be affected? Which brings the question - how do Similarity impls know what flags the DE was opened with, and shouldn't they be specialized? E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as an index to an array, so clearly it assumes it is >= 0 and also < scoreCache.length. So I wonder what will happen to it when someone's Codec will return a negative value or MAX_INT in case frequencies aren't needed? I do realize that you shouldn't call Similarity with missing information, and TermWeight obtains a DocsEnum with frequencies, so in that regard it is safe. And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what you're doing and don't pass a random freq() to Similarity. I lean towards documenting the spec from above, and ensuring that all Codecs return 1 for DOCS_ONLY. If in the future we'll need to handle the case where someone receives a DocsEnum which it needs to consume, and doesn't know which flags were used to open it, we can always add a getFlags to DE. Shai On Mon, Dec 17, 2012 at 11:30 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera <ser...@gmail.com> wrote: > > How do these two go together? > > > >> I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we > >> don't know): lots of places would otherwise have to be special cased > >> for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. > > > > > > and > > > >> I'm also not sure that > >> all codecs return 1 today if the fields was indexed with DOCS_ONLY ... > > > > > > That just makes it even worse right? I.e., we have code today that > relies > > no that behavior, but we're not sure it works w/ all Codecs? > > Sorry, for my last sentence above I think I meant "I'm also not sure > that all codecs return 1 today if you ask for FLAG_NONE". > > > Remember that DocIdSetIterator.nextDoc() was loosely specified? It was > very > > hard to write a decent DISI consumer. Sometimes calling nextDoc() > returned > > MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it > > actually made consumers' life easier, I think? > > Right, locking down the API makes total sense in general. > > > It's ok if we say that for DOCS_ONLY you have to return 1. That's even > 99.9% > > of the time the correct value to return (unless someone adds e.g. the > same > > StringField twice to the document). > > Right. > > > And it's also ok to say that if you passed FLAG_NONE, freq()'s value is > > unspecified. I think it would be wrong to "lie" here .. not sure if the > > consumer always knows how DocsEnum was requested. Not sure if this > happens > > in real life though (consuming a DocsEnum that you didn't obtain > yourself), > > so I'm willing to ignore that case. > > +1: I think FLAG_NONE should remain undefined. I think we have codecs > today that will return 1, 0, the actual doc freq (when the field was > indexed as >= DOCS_AND_FREQS). > > > These two together sound like a reasonable "spec" to me? > > +1 > > So I think just change your javadocs patch to say that FLAG_NONE means > freq is not defined, and if field was indexed as DOCS_ONLY and you > asked for FLAG_FREQS then we promise to lie and say freq=1? > > Mike McCandless > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >