Here's the patch with test: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java =================================================================== --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1423774) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -19,6 +19,7 @@
import java.io.IOException; +import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.Bits; // javadocs @@ -47,10 +48,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the document was + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * <p> + * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * the result of this method is undefined. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java =================================================================== --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (revision 1423774) +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (working copy) @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashSet; import java.util.Iterator; +import java.util.Random; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.codecs.Codec; @@ -31,7 +32,9 @@ import org.apache.lucene.codecs.TermsConsumer; import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; @@ -630,4 +633,33 @@ } consumer.close(); } + + public void testDocsOnlyFreq() throws Exception { + // tests that when fields are indexed with DOCS_ONLY, the Codec + // returns 1 in docsEnum.freq() + Directory dir = newDirectory(); + Random random = random(); + IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( + TEST_VERSION_CURRENT, new MockAnalyzer(random))); + // we don't need many documents to assert this, but don't use one document either + int numDocs = atLeast(random, 50); + for (int i = 0; i < numDocs; i++) { + Document doc = new Document(); + doc.add(new StringField("f", "doc", Store.NO)); + writer.addDocument(doc); + } + writer.close(); + + Term term = new Term("f", new BytesRef("doc")); + DirectoryReader reader = DirectoryReader.open(dir); + for (AtomicReaderContext ctx : reader.leaves()) { + DocsEnum de = ctx.reader().termDocsEnum(term); + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { + assertEquals("wrong freq for doc " + de.docID(), 1, de.freq()); + } + } + reader.close(); + + dir.close(); + } } BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Shai On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote: > > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do > we > > have a test that can trip bad Codecs? > > I'm not sure! We should make a test & fix any failing ones ... > > > It may be more than just changing the documentation... > > Right. > > > Why would e.g. TermQuery need to write specialized code for these cases? > I > > looked at TermScorer, and its freq() just returns docsEnum.freq(). > > I meant if we did not adopt this spec ("freq() will lie and return 1 > when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would > need specialized code. > > > I think that Similarity may be affected? Which brings the question - how > do > > Similarity impls know what flags the DE was opened with, and shouldn't > they > > be specialized? > > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() > as > > an index to an array, so clearly it assumes it is >= 0 and also < > > scoreCache.length. > > So I wonder what will happen to it when someone's Codec will return a > > negative value or MAX_INT in case frequencies aren't needed? > > Well, if you passed FLAGS_NONE when you opened the DE then it's your > responsibility to never call freq() ... ie, don't call freq() and pass > that to the sim. > > > I do realize that you shouldn't call Similarity with missing information, > > and TermWeight obtains a DocsEnum with frequencies, so in that regard it > is > > safe. > > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what > > you're doing and don't pass a random freq() to Similarity. > > Right. > > > I lean towards documenting the spec from above, and ensuring that all > Codecs > > return 1 for DOCS_ONLY. > > +1 > > So freq() is undefined if you had passed FLAGS_NONE, and we will lie > and say freq=1 (need a test verifying this) if the field was indexed > as DOCS_ONLY. > > > If in the future we'll need to handle the case where someone receives a > > DocsEnum which it needs to consume, and doesn't know which flags were > used > > to open it, we can always add a getFlags to DE. > > Yeah ... > > Mike McCandless > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >