+1, but can you change "... if the Document was indexed with DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?
Thanks Shai! Mike McCandless http://blog.mikemccandless.com On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote: > Here's the patch with test: > > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java > =================================================================== > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision > 1423774) > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working > copy) > @@ -19,6 +19,7 @@ > > import java.io.IOException; > > +import org.apache.lucene.index.FieldInfo.IndexOptions; > import org.apache.lucene.search.DocIdSetIterator; > import org.apache.lucene.util.AttributeSource; > import org.apache.lucene.util.Bits; // javadocs > @@ -47,10 +48,16 @@ > > protected DocsEnum() { > } > > - /** Returns term frequency in the current document. Do > - * not call this before {@link #nextDoc} is first called, > - * nor after {@link #nextDoc} returns NO_MORE_DOCS. > - **/ > + /** > + * Returns term frequency in the current document, or 1 if the document > was > + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before > + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns > > + * {@link DocIdSetIterator#NO_MORE_DOCS}. > + * > + * <p> > + * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link > #FLAG_NONE}, > + * the result of this method is undefined. > > + */ > public abstract int freq() throws IOException; > > /** Returns the related attributes. */ > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > =================================================================== > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > (revision 1423774) > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > (working copy) > @@ -21,6 +21,7 @@ > import java.util.Arrays; > import java.util.HashSet; > import java.util.Iterator; > +import java.util.Random; > > import org.apache.lucene.analysis.MockAnalyzer; > import org.apache.lucene.codecs.Codec; > @@ -31,7 +32,9 @@ > import org.apache.lucene.codecs.TermsConsumer; > import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; > import org.apache.lucene.document.Document; > +import org.apache.lucene.document.Field.Store; > import org.apache.lucene.document.FieldType; > +import org.apache.lucene.document.StringField; > import org.apache.lucene.document.TextField; > import org.apache.lucene.index.FieldInfo.IndexOptions; > import org.apache.lucene.search.DocIdSetIterator; > @@ -630,4 +633,33 @@ > } > consumer.close(); > } > + > + public void testDocsOnlyFreq() throws Exception { > + // tests that when fields are indexed with DOCS_ONLY, the Codec > + // returns 1 in docsEnum.freq() > + Directory dir = newDirectory(); > + Random random = random(); > + IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( > + TEST_VERSION_CURRENT, new MockAnalyzer(random))); > + // we don't need many documents to assert this, but don't use one > document either > + int numDocs = atLeast(random, 50); > + for (int i = 0; i < numDocs; i++) { > + Document doc = new Document(); > + doc.add(new StringField("f", "doc", Store.NO)); > + writer.addDocument(doc); > + } > + writer.close(); > + > + Term term = new Term("f", new BytesRef("doc")); > + DirectoryReader reader = DirectoryReader.open(dir); > + for (AtomicReaderContext ctx : reader.leaves()) { > + DocsEnum de = ctx.reader().termDocsEnum(term); > + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { > + assertEquals("wrong freq for doc " + de.docID(), 1, de.freq()); > + } > + } > + reader.close(); > + > + dir.close(); > + } > } > > BTW, I don't know if it should be like that or not, but I ran this test with > -Dtests.iters=1000 and printed at the end of each iteration > Codec.getDefault(). > For all iterations it printed the same Codec, but if I ran the test many > times (no iteration), it printed different Codecs in each run. > I thought that tests.iters should simulate the behavior of running the test > many times? And therefore pick a new seed + Codec (among other things) for > each iteration? > > Shai > > > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote: >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do >> > we >> > have a test that can trip bad Codecs? >> >> I'm not sure! We should make a test & fix any failing ones ... >> >> > It may be more than just changing the documentation... >> >> Right. >> >> > Why would e.g. TermQuery need to write specialized code for these cases? >> > I >> > looked at TermScorer, and its freq() just returns docsEnum.freq(). >> >> I meant if we did not adopt this spec ("freq() will lie and return 1 >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would >> need specialized code. >> >> > I think that Similarity may be affected? Which brings the question - how >> > do >> > Similarity impls know what flags the DE was opened with, and shouldn't >> > they >> > be specialized? >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() >> > as >> > an index to an array, so clearly it assumes it is >= 0 and also < >> > scoreCache.length. >> > So I wonder what will happen to it when someone's Codec will return a >> > negative value or MAX_INT in case frequencies aren't needed? >> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your >> responsibility to never call freq() ... ie, don't call freq() and pass >> that to the sim. >> >> > I do realize that you shouldn't call Similarity with missing >> > information, >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard it >> > is >> > safe. >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what >> > you're doing and don't pass a random freq() to Similarity. >> >> Right. >> >> > I lean towards documenting the spec from above, and ensuring that all >> > Codecs >> > return 1 for DOCS_ONLY. >> >> +1 >> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie >> and say freq=1 (need a test verifying this) if the field was indexed >> as DOCS_ONLY. >> >> > If in the future we'll need to handle the case where someone receives a >> > DocsEnum which it needs to consume, and doesn't know which flags were >> > used >> > to open it, we can always add a getFlags to DE. >> >> Yeah ... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org