Re: DocsEnum.freq()

Shai Erera Wed, 19 Dec 2012 00:15:27 -0800

Here's the patch with test:

Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
===================================================================
--- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
1423774)
+++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
copy)
@@ -19,6 +19,7 @@


 import java.io.IOException;

+import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.util.AttributeSource;
 import org.apache.lucene.util.Bits; // javadocs
@@ -47,10 +48,16 @@
   protected DocsEnum() {
   }

-  /** Returns term frequency in the current document.  Do
-   *  not call this before {@link #nextDoc} is first called,
-   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
-   **/
+  /**
+   * Returns term frequency in the current document, or 1 if the document
was
+   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
+   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns
+   * {@link DocIdSetIterator#NO_MORE_DOCS}.
+   *
+   * <p>
+   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
#FLAG_NONE},
+   * the result of this method is undefined.
+   */
   public abstract int freq() throws IOException;

   /** Returns the related attributes. */
Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
===================================================================
--- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(revision 1423774)
+++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(working copy)
@@ -21,6 +21,7 @@
 import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Iterator;
+import java.util.Random;

 import org.apache.lucene.analysis.MockAnalyzer;
 import org.apache.lucene.codecs.Codec;
@@ -31,7 +32,9 @@
 import org.apache.lucene.codecs.TermsConsumer;
 import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
 import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field.Store;
 import org.apache.lucene.document.FieldType;
+import org.apache.lucene.document.StringField;
 import org.apache.lucene.document.TextField;
 import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
@@ -630,4 +633,33 @@
     }
     consumer.close();
   }
+
+  public void testDocsOnlyFreq() throws Exception {
+    // tests that when fields are indexed with DOCS_ONLY, the Codec
+    // returns 1 in docsEnum.freq()
+    Directory dir = newDirectory();
+    Random random = random();
+    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
+        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
+    // we don't need many documents to assert this, but don't use one
document either
+    int numDocs = atLeast(random, 50);
+    for (int i = 0; i < numDocs; i++) {
+      Document doc = new Document();
+      doc.add(new StringField("f", "doc", Store.NO));
+      writer.addDocument(doc);
+    }
+    writer.close();
+
+    Term term = new Term("f", new BytesRef("doc"));
+    DirectoryReader reader = DirectoryReader.open(dir);
+    for (AtomicReaderContext ctx : reader.leaves()) {
+      DocsEnum de = ctx.reader().termDocsEnum(term);
+      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
+      }
+    }
+    reader.close();
+
+    dir.close();
+  }
 }

BTW, I don't know if it should be like that or not, but I ran this test
with -Dtests.iters=1000 and printed at the end of each iteration
Codec.getDefault().
For all iterations it printed the same Codec, but if I ran the test many
times (no iteration), it printed different Codecs in each run.
I thought that tests.iters should simulate the behavior of running the test
many times? And therefore pick a new seed + Codec (among other things) for
each iteration?

Shai


On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote:
> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
> we
> > have a test that can trip bad Codecs?
>
> I'm not sure!  We should make a test & fix any failing ones ...
>
> > It may be more than just changing the documentation...
>
> Right.
>
> > Why would e.g. TermQuery need to write specialized code for these cases?
> I
> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>
> I meant if we did not adopt this spec ("freq() will lie and return 1
> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
> need specialized code.
>
> > I think that Similarity may be affected? Which brings the question - how
> do
> > Similarity impls know what flags the DE was opened with, and shouldn't
> they
> > be specialized?
> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score()
> as
> > an index to an array, so clearly it assumes it is >= 0 and also <
> > scoreCache.length.
> > So I wonder what will happen to it when someone's Codec will return a
> > negative value or MAX_INT in case frequencies aren't needed?
>
> Well, if you passed FLAGS_NONE when you opened the DE then it's your
> responsibility to never call freq() ... ie, don't call freq() and pass
> that to the sim.
>
> > I do realize that you shouldn't call Similarity with missing information,
> > and TermWeight obtains a DocsEnum with frequencies, so in that regard it
> is
> > safe.
> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
> > you're doing and don't pass a random freq() to Similarity.
>
> Right.
>
> > I lean towards documenting the spec from above, and ensuring that all
> Codecs
> > return 1 for DOCS_ONLY.
>
> +1
>
> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
> and say freq=1 (need a test verifying this) if the field was indexed
> as DOCS_ONLY.
>
> > If in the future we'll need to handle the case where someone receives a
> > DocsEnum which it needs to consume, and doesn't know which flags were
> used
> > to open it, we can always add a getFlags to DE.
>
> Yeah ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: DocsEnum.freq()

Reply via email to