Re: DocsEnum.freq()

Robert Muir Wed, 19 Dec 2012 06:55:42 -0800

+1

On Wed, Dec 19, 2012 at 7:42 AM, Shai Erera <ser...@gmail.com> wrote:
> Thanks for catching that Mike. Will change and commit.
>
> Shai
>
>
> On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> +1, but can you change "... if the Document was indexed with
>> DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?
>>
>> Thanks Shai!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote:
>> > Here's the patch with test:
>> >
>> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
>> > ===================================================================
>> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
>> > (revision
>> > 1423774)
>> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
>> > copy)
>> > @@ -19,6 +19,7 @@
>> >
>> >  import java.io.IOException;
>> >
>> > +import org.apache.lucene.index.FieldInfo.IndexOptions;
>> >  import org.apache.lucene.search.DocIdSetIterator;
>> >  import org.apache.lucene.util.AttributeSource;
>> >  import org.apache.lucene.util.Bits; // javadocs
>> > @@ -47,10 +48,16 @@
>> >
>> >    protected DocsEnum() {
>> >    }
>> >
>> > -  /** Returns term frequency in the current document.  Do
>> > -   *  not call this before {@link #nextDoc} is first called,
>> > -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
>> > -   **/
>> > +  /**
>> > +   * Returns term frequency in the current document, or 1 if the
>> > document
>> > was
>> > +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
>> > before
>> > +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
>> > returns
>> >
>> > +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
>> > +   *
>> > +   * <p>
>> > +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
>> > #FLAG_NONE},
>> > +   * the result of this method is undefined.
>> >
>> > +   */
>> >    public abstract int freq() throws IOException;
>> >
>> >    /** Returns the related attributes. */
>> > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > ===================================================================
>> > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > (revision 1423774)
>> > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > (working copy)
>> > @@ -21,6 +21,7 @@
>> >  import java.util.Arrays;
>> >  import java.util.HashSet;
>> >  import java.util.Iterator;
>> > +import java.util.Random;
>> >
>> >  import org.apache.lucene.analysis.MockAnalyzer;
>> >  import org.apache.lucene.codecs.Codec;
>> > @@ -31,7 +32,9 @@
>> >  import org.apache.lucene.codecs.TermsConsumer;
>> >  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
>> >  import org.apache.lucene.document.Document;
>> > +import org.apache.lucene.document.Field.Store;
>> >  import org.apache.lucene.document.FieldType;
>> > +import org.apache.lucene.document.StringField;
>> >  import org.apache.lucene.document.TextField;
>> >  import org.apache.lucene.index.FieldInfo.IndexOptions;
>> >  import org.apache.lucene.search.DocIdSetIterator;
>> > @@ -630,4 +633,33 @@
>> >      }
>> >      consumer.close();
>> >    }
>> > +
>> > +  public void testDocsOnlyFreq() throws Exception {
>> > +    // tests that when fields are indexed with DOCS_ONLY, the Codec
>> > +    // returns 1 in docsEnum.freq()
>> > +    Directory dir = newDirectory();
>> > +    Random random = random();
>> > +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
>> > +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
>> > +    // we don't need many documents to assert this, but don't use one
>> > document either
>> > +    int numDocs = atLeast(random, 50);
>> > +    for (int i = 0; i < numDocs; i++) {
>> > +      Document doc = new Document();
>> > +      doc.add(new StringField("f", "doc", Store.NO));
>> > +      writer.addDocument(doc);
>> > +    }
>> > +    writer.close();
>> > +
>> > +    Term term = new Term("f", new BytesRef("doc"));
>> > +    DirectoryReader reader = DirectoryReader.open(dir);
>> > +    for (AtomicReaderContext ctx : reader.leaves()) {
>> > +      DocsEnum de = ctx.reader().termDocsEnum(term);
>> > +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
>> > +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
>> > +      }
>> > +    }
>> > +    reader.close();
>> > +
>> > +    dir.close();
>> > +  }
>> >  }
>> >
>> > BTW, I don't know if it should be like that or not, but I ran this test
>> > with
>> > -Dtests.iters=1000 and printed at the end of each iteration
>> > Codec.getDefault().
>> > For all iterations it printed the same Codec, but if I ran the test many
>> > times (no iteration), it printed different Codecs in each run.
>> > I thought that tests.iters should simulate the behavior of running the
>> > test
>> > many times? And therefore pick a new seed + Codec (among other things)
>> > for
>> > each iteration?
>> >
>> > Shai
>> >
>> >
>> > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
>> > <luc...@mikemccandless.com> wrote:
>> >>
>> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote:
>> >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
>> >> > Do
>> >> > we
>> >> > have a test that can trip bad Codecs?
>> >>
>> >> I'm not sure!  We should make a test & fix any failing ones ...
>> >>
>> >> > It may be more than just changing the documentation...
>> >>
>> >> Right.
>> >>
>> >> > Why would e.g. TermQuery need to write specialized code for these
>> >> > cases?
>> >> > I
>> >> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>> >>
>> >> I meant if we did not adopt this spec ("freq() will lie and return 1
>> >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
>> >> need specialized code.
>> >>
>> >> > I think that Similarity may be affected? Which brings the question -
>> >> > how
>> >> > do
>> >> > Similarity impls know what flags the DE was opened with, and
>> >> > shouldn't
>> >> > they
>> >> > be specialized?
>> >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to
>> >> > score()
>> >> > as
>> >> > an index to an array, so clearly it assumes it is >= 0 and also <
>> >> > scoreCache.length.
>> >> > So I wonder what will happen to it when someone's Codec will return a
>> >> > negative value or MAX_INT in case frequencies aren't needed?
>> >>
>> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your
>> >> responsibility to never call freq() ... ie, don't call freq() and pass
>> >> that to the sim.
>> >>
>> >> > I do realize that you shouldn't call Similarity with missing
>> >> > information,
>> >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard
>> >> > it
>> >> > is
>> >> > safe.
>> >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know
>> >> > what
>> >> > you're doing and don't pass a random freq() to Similarity.
>> >>
>> >> Right.
>> >>
>> >> > I lean towards documenting the spec from above, and ensuring that all
>> >> > Codecs
>> >> > return 1 for DOCS_ONLY.
>> >>
>> >> +1
>> >>
>> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
>> >> and say freq=1 (need a test verifying this) if the field was indexed
>> >> as DOCS_ONLY.
>> >>
>> >> > If in the future we'll need to handle the case where someone receives
>> >> > a
>> >> > DocsEnum which it needs to consume, and doesn't know which flags were
>> >> > used
>> >> > to open it, we can always add a getFlags to DE.
>> >>
>> >> Yeah ...
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: DocsEnum.freq()

Reply via email to