Re: DocsEnum.freq()

Shai Erera Wed, 19 Dec 2012 04:42:57 -0800

Thanks for catching that Mike. Will change and commit.

Shai



On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1, but can you change "... if the Document was indexed with
> DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?
>
> Thanks Shai!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote:
> > Here's the patch with test:
> >
> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> > ===================================================================
> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> > 1423774)
> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> > copy)
> > @@ -19,6 +19,7 @@
> >
> >  import java.io.IOException;
> >
> > +import org.apache.lucene.index.FieldInfo.IndexOptions;
> >  import org.apache.lucene.search.DocIdSetIterator;
> >  import org.apache.lucene.util.AttributeSource;
> >  import org.apache.lucene.util.Bits; // javadocs
> > @@ -47,10 +48,16 @@
> >
> >    protected DocsEnum() {
> >    }
> >
> > -  /** Returns term frequency in the current document.  Do
> > -   *  not call this before {@link #nextDoc} is first called,
> > -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> > -   **/
> > +  /**
> > +   * Returns term frequency in the current document, or 1 if the
> document
> > was
> > +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
> before
> > +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
> returns
> >
> > +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> > +   *
> > +   * <p>
> > +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> > #FLAG_NONE},
> > +   * the result of this method is undefined.
> >
> > +   */
> >    public abstract int freq() throws IOException;
> >
> >    /** Returns the related attributes. */
> > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > ===================================================================
> > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > (revision 1423774)
> > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > (working copy)
> > @@ -21,6 +21,7 @@
> >  import java.util.Arrays;
> >  import java.util.HashSet;
> >  import java.util.Iterator;
> > +import java.util.Random;
> >
> >  import org.apache.lucene.analysis.MockAnalyzer;
> >  import org.apache.lucene.codecs.Codec;
> > @@ -31,7 +32,9 @@
> >  import org.apache.lucene.codecs.TermsConsumer;
> >  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
> >  import org.apache.lucene.document.Document;
> > +import org.apache.lucene.document.Field.Store;
> >  import org.apache.lucene.document.FieldType;
> > +import org.apache.lucene.document.StringField;
> >  import org.apache.lucene.document.TextField;
> >  import org.apache.lucene.index.FieldInfo.IndexOptions;
> >  import org.apache.lucene.search.DocIdSetIterator;
> > @@ -630,4 +633,33 @@
> >      }
> >      consumer.close();
> >    }
> > +
> > +  public void testDocsOnlyFreq() throws Exception {
> > +    // tests that when fields are indexed with DOCS_ONLY, the Codec
> > +    // returns 1 in docsEnum.freq()
> > +    Directory dir = newDirectory();
> > +    Random random = random();
> > +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
> > +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
> > +    // we don't need many documents to assert this, but don't use one
> > document either
> > +    int numDocs = atLeast(random, 50);
> > +    for (int i = 0; i < numDocs; i++) {
> > +      Document doc = new Document();
> > +      doc.add(new StringField("f", "doc", Store.NO));
> > +      writer.addDocument(doc);
> > +    }
> > +    writer.close();
> > +
> > +    Term term = new Term("f", new BytesRef("doc"));
> > +    DirectoryReader reader = DirectoryReader.open(dir);
> > +    for (AtomicReaderContext ctx : reader.leaves()) {
> > +      DocsEnum de = ctx.reader().termDocsEnum(term);
> > +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
> > +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
> > +      }
> > +    }
> > +    reader.close();
> > +
> > +    dir.close();
> > +  }
> >  }
> >
> > BTW, I don't know if it should be like that or not, but I ran this test
> with
> > -Dtests.iters=1000 and printed at the end of each iteration
> > Codec.getDefault().
> > For all iterations it printed the same Codec, but if I ran the test many
> > times (no iteration), it printed different Codecs in each run.
> > I thought that tests.iters should simulate the behavior of running the
> test
> > many times? And therefore pick a new seed + Codec (among other things)
> for
> > each iteration?
> >
> > Shai
> >
> >
> > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
> > <luc...@mikemccandless.com> wrote:
> >>
> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote:
> >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
> Do
> >> > we
> >> > have a test that can trip bad Codecs?
> >>
> >> I'm not sure!  We should make a test & fix any failing ones ...
> >>
> >> > It may be more than just changing the documentation...
> >>
> >> Right.
> >>
> >> > Why would e.g. TermQuery need to write specialized code for these
> cases?
> >> > I
> >> > looked at TermScorer, and its freq() just returns docsEnum.freq().
> >>
> >> I meant if we did not adopt this spec ("freq() will lie and return 1
> >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
> >> need specialized code.
> >>
> >> > I think that Similarity may be affected? Which brings the question -
> how
> >> > do
> >> > Similarity impls know what flags the DE was opened with, and shouldn't
> >> > they
> >> > be specialized?
> >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to
> score()
> >> > as
> >> > an index to an array, so clearly it assumes it is >= 0 and also <
> >> > scoreCache.length.
> >> > So I wonder what will happen to it when someone's Codec will return a
> >> > negative value or MAX_INT in case frequencies aren't needed?
> >>
> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your
> >> responsibility to never call freq() ... ie, don't call freq() and pass
> >> that to the sim.
> >>
> >> > I do realize that you shouldn't call Similarity with missing
> >> > information,
> >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard
> it
> >> > is
> >> > safe.
> >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
> >> > you're doing and don't pass a random freq() to Similarity.
> >>
> >> Right.
> >>
> >> > I lean towards documenting the spec from above, and ensuring that all
> >> > Codecs
> >> > return 1 for DOCS_ONLY.
> >>
> >> +1
> >>
> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
> >> and say freq=1 (need a test verifying this) if the field was indexed
> >> as DOCS_ONLY.
> >>
> >> > If in the future we'll need to handle the case where someone receives
> a
> >> > DocsEnum which it needs to consume, and doesn't know which flags were
> >> > used
> >> > to open it, we can always add a getFlags to DE.
> >>
> >> Yeah ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: DocsEnum.freq()

Reply via email to