+1, but can you change "... if the Document was indexed with
DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?

Thanks Shai!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote:
> Here's the patch with test:
>
> Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> ===================================================================
> --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> 1423774)
> +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> copy)
> @@ -19,6 +19,7 @@
>
>  import java.io.IOException;
>
> +import org.apache.lucene.index.FieldInfo.IndexOptions;
>  import org.apache.lucene.search.DocIdSetIterator;
>  import org.apache.lucene.util.AttributeSource;
>  import org.apache.lucene.util.Bits; // javadocs
> @@ -47,10 +48,16 @@
>
>    protected DocsEnum() {
>    }
>
> -  /** Returns term frequency in the current document.  Do
> -   *  not call this before {@link #nextDoc} is first called,
> -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> -   **/
> +  /**
> +   * Returns term frequency in the current document, or 1 if the document
> was
> +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
> +   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns
>
> +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> +   *
> +   * <p>
> +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> #FLAG_NONE},
> +   * the result of this method is undefined.
>
> +   */
>    public abstract int freq() throws IOException;
>
>    /** Returns the related attributes. */
> Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> ===================================================================
> --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> (revision 1423774)
> +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> (working copy)
> @@ -21,6 +21,7 @@
>  import java.util.Arrays;
>  import java.util.HashSet;
>  import java.util.Iterator;
> +import java.util.Random;
>
>  import org.apache.lucene.analysis.MockAnalyzer;
>  import org.apache.lucene.codecs.Codec;
> @@ -31,7 +32,9 @@
>  import org.apache.lucene.codecs.TermsConsumer;
>  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
>  import org.apache.lucene.document.Document;
> +import org.apache.lucene.document.Field.Store;
>  import org.apache.lucene.document.FieldType;
> +import org.apache.lucene.document.StringField;
>  import org.apache.lucene.document.TextField;
>  import org.apache.lucene.index.FieldInfo.IndexOptions;
>  import org.apache.lucene.search.DocIdSetIterator;
> @@ -630,4 +633,33 @@
>      }
>      consumer.close();
>    }
> +
> +  public void testDocsOnlyFreq() throws Exception {
> +    // tests that when fields are indexed with DOCS_ONLY, the Codec
> +    // returns 1 in docsEnum.freq()
> +    Directory dir = newDirectory();
> +    Random random = random();
> +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
> +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
> +    // we don't need many documents to assert this, but don't use one
> document either
> +    int numDocs = atLeast(random, 50);
> +    for (int i = 0; i < numDocs; i++) {
> +      Document doc = new Document();
> +      doc.add(new StringField("f", "doc", Store.NO));
> +      writer.addDocument(doc);
> +    }
> +    writer.close();
> +
> +    Term term = new Term("f", new BytesRef("doc"));
> +    DirectoryReader reader = DirectoryReader.open(dir);
> +    for (AtomicReaderContext ctx : reader.leaves()) {
> +      DocsEnum de = ctx.reader().termDocsEnum(term);
> +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
> +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
> +      }
> +    }
> +    reader.close();
> +
> +    dir.close();
> +  }
>  }
>
> BTW, I don't know if it should be like that or not, but I ran this test with
> -Dtests.iters=1000 and printed at the end of each iteration
> Codec.getDefault().
> For all iterations it printed the same Codec, but if I ran the test many
> times (no iteration), it printed different Codecs in each run.
> I thought that tests.iters should simulate the behavior of running the test
> many times? And therefore pick a new seed + Codec (among other things) for
> each iteration?
>
> Shai
>
>
> On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote:
>> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
>> > we
>> > have a test that can trip bad Codecs?
>>
>> I'm not sure!  We should make a test & fix any failing ones ...
>>
>> > It may be more than just changing the documentation...
>>
>> Right.
>>
>> > Why would e.g. TermQuery need to write specialized code for these cases?
>> > I
>> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>>
>> I meant if we did not adopt this spec ("freq() will lie and return 1
>> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
>> need specialized code.
>>
>> > I think that Similarity may be affected? Which brings the question - how
>> > do
>> > Similarity impls know what flags the DE was opened with, and shouldn't
>> > they
>> > be specialized?
>> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score()
>> > as
>> > an index to an array, so clearly it assumes it is >= 0 and also <
>> > scoreCache.length.
>> > So I wonder what will happen to it when someone's Codec will return a
>> > negative value or MAX_INT in case frequencies aren't needed?
>>
>> Well, if you passed FLAGS_NONE when you opened the DE then it's your
>> responsibility to never call freq() ... ie, don't call freq() and pass
>> that to the sim.
>>
>> > I do realize that you shouldn't call Similarity with missing
>> > information,
>> > and TermWeight obtains a DocsEnum with frequencies, so in that regard it
>> > is
>> > safe.
>> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
>> > you're doing and don't pass a random freq() to Similarity.
>>
>> Right.
>>
>> > I lean towards documenting the spec from above, and ensuring that all
>> > Codecs
>> > return 1 for DOCS_ONLY.
>>
>> +1
>>
>> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
>> and say freq=1 (need a test verifying this) if the field was indexed
>> as DOCS_ONLY.
>>
>> > If in the future we'll need to handle the case where someone receives a
>> > DocsEnum which it needs to consume, and doesn't know which flags were
>> > used
>> > to open it, we can always add a getFlags to DE.
>>
>> Yeah ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to