[jira] [Created] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
Kai Chan created LUCENE-5315: Summary: Some subclasses of Scorer do not honor the contract of DocsEnum.freq() Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
[ https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Chan updated LUCENE-5315: - Attachment: DisjunctionMaxQueryTest.java This is a very simple program that illustrates the problem. I am attaching the output in a separate file. Some subclasses of Scorer do not honor the contract of DocsEnum.freq() -- Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan Attachments: DisjunctionMaxQueryTest.java The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
[ https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Chan updated LUCENE-5315: - Attachment: DisjunctionMaxQueryTest.java.output Here is the output of DisjunctionMaxQueryTest. For the freq() method, the return values (3, 4, 5) from TermScorer are term frequencies, while the return values (1, 2) from ConjunctionScorer and others are clause counts. Some subclasses of Scorer do not honor the contract of DocsEnum.freq() -- Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan Attachments: DisjunctionMaxQueryTest.java, DisjunctionMaxQueryTest.java.output The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
[ https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808586#comment-13808586 ] Robert Muir commented on LUCENE-5315: - I dont see the bug, as it said in LUCENE-4514, its the number of matches in the document for that scorer. So if its a Termscorer its the number of occurrences of that term. if its a phrasescorer its the number of occurrences of that phrase. If its a booleanscorer, its the number of occurrences of that boolean conjunct. It does not try to 'sum' or anything. if you want to that, you can easily do it since you can navigate subscorers with getChildren. Some subclasses of Scorer do not honor the contract of DocsEnum.freq() -- Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan Attachments: DisjunctionMaxQueryTest.java, DisjunctionMaxQueryTest.java.output The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
[ https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808607#comment-13808607 ] Kai Chan commented on LUCENE-5315: -- This is how I see it. I could be wrong though, and please feel free to point out if it is the case. # The API reference of DocsEnum.freq() states that its return value is term frequency in the current document, or 1 if the field was indexed with FieldInfo.IndexOptions.DOCS_ONLY. (I do not think the second part applies here though, just the first part.) # Scorer and its subclasses extend DocEnum, and therefore their freq() methods override DocsEnum.freq(). # Because of (1) and (2), I thought the freq() method of Scorer and its subclasses should also return term frequency in the current document, to honor the contract set by DocsEnum.freq(). # However, for some of Scorer's subclasses, (3) is not how the classes are currently implemented. Some subclasses of Scorer do not honor the contract of DocsEnum.freq() -- Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan Attachments: DisjunctionMaxQueryTest.java, DisjunctionMaxQueryTest.java.output The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
[ https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808615#comment-13808615 ] Robert Muir commented on LUCENE-5315: - i think DocsEnum.freq() should be changed to say matches. According to your strict definition, PhraseScorer is broken too, because it doesnt return term frequencies :) Some subclasses of Scorer do not honor the contract of DocsEnum.freq() -- Key: LUCENE-5315 URL: https://issues.apache.org/jira/browse/LUCENE-5315 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.5.1 Reporter: Kai Chan Attachments: DisjunctionMaxQueryTest.java, DisjunctionMaxQueryTest.java.output The behavior of Scorer.freq() is inconsistent across its subclasses: * For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, i.e. the method returns the term frequency in the current document. * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, DisjunctionSumScorer, and possibly other classes, the freq() method returns the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the current document. This difference makes the meaning of Scorer.freq()'s return value uncertain. To add to the uncertainty, given a Query, there seems to be no way of knowing which behavior takes effect (as that is not specified in the API reference) except by reading or running the code. This issue might be related to LUCENE-4514. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
Here's the patch with test: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1423774) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; +import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.Bits; // javadocs @@ -47,10 +48,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the document was + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * the result of this method is undefined. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java === --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (revision 1423774) +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (working copy) @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashSet; import java.util.Iterator; +import java.util.Random; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.codecs.Codec; @@ -31,7 +32,9 @@ import org.apache.lucene.codecs.TermsConsumer; import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; @@ -630,4 +633,33 @@ } consumer.close(); } + + public void testDocsOnlyFreq() throws Exception { +// tests that when fields are indexed with DOCS_ONLY, the Codec +// returns 1 in docsEnum.freq() +Directory dir = newDirectory(); +Random random = random(); +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( +TEST_VERSION_CURRENT, new MockAnalyzer(random))); +// we don't need many documents to assert this, but don't use one document either +int numDocs = atLeast(random, 50); +for (int i = 0; i numDocs; i++) { + Document doc = new Document(); + doc.add(new StringField(f, doc, Store.NO)); + writer.addDocument(doc); +} +writer.close(); + +Term term = new Term(f, new BytesRef(doc)); +DirectoryReader reader = DirectoryReader.open(dir); +for (AtomicReaderContext ctx : reader.leaves()) { + DocsEnum de = ctx.reader().termDocsEnum(term); + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { +assertEquals(wrong freq for doc + de.docID(), 1, de.freq()); + } +} +reader.close(); + +dir.close(); + } } BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Shai On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote: Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? I'm not sure! We should make a test fix any failing ones ... It may be more than just changing the documentation... Right. Why would e.g. TermQuery need to write specialized code for these cases? I looked at TermScorer, and its freq() just returns docsEnum.freq(). I meant if we did not adopt this spec (freq() will lie and return 1 when the field was indexed as DOCS_ONLY), then e.g. TermQuery would need specialized code. I think that Similarity may be affected? Which brings the question - how do Similarity impls know what flags the DE was opened with, and shouldn't they be specialized? E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as an index to an array, so clearly it assumes it is = 0 and also scoreCache.length. So I
Re: DocsEnum.freq()
+1, but can you change ... if the Document was indexed with DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY? Thanks Shai! Mike McCandless http://blog.mikemccandless.com On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote: Here's the patch with test: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1423774) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; +import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.Bits; // javadocs @@ -47,10 +48,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the document was + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * the result of this method is undefined. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java === --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (revision 1423774) +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (working copy) @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashSet; import java.util.Iterator; +import java.util.Random; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.codecs.Codec; @@ -31,7 +32,9 @@ import org.apache.lucene.codecs.TermsConsumer; import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; @@ -630,4 +633,33 @@ } consumer.close(); } + + public void testDocsOnlyFreq() throws Exception { +// tests that when fields are indexed with DOCS_ONLY, the Codec +// returns 1 in docsEnum.freq() +Directory dir = newDirectory(); +Random random = random(); +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( +TEST_VERSION_CURRENT, new MockAnalyzer(random))); +// we don't need many documents to assert this, but don't use one document either +int numDocs = atLeast(random, 50); +for (int i = 0; i numDocs; i++) { + Document doc = new Document(); + doc.add(new StringField(f, doc, Store.NO)); + writer.addDocument(doc); +} +writer.close(); + +Term term = new Term(f, new BytesRef(doc)); +DirectoryReader reader = DirectoryReader.open(dir); +for (AtomicReaderContext ctx : reader.leaves()) { + DocsEnum de = ctx.reader().termDocsEnum(term); + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { +assertEquals(wrong freq for doc + de.docID(), 1, de.freq()); + } +} +reader.close(); + +dir.close(); + } } BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Shai On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote: Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? I'm not sure! We should make a test fix any failing ones ... It may be more than just changing the documentation... Right. Why would e.g. TermQuery need to write specialized code for these cases? I looked at TermScorer, and its freq() just returns docsEnum.freq(). I meant if we did not adopt this spec (freq() will lie and return 1 when the field was indexed as DOCS_ONLY), then e.g. TermQuery would need specialized
Re: DocsEnum.freq()
Thanks for catching that Mike. Will change and commit. Shai On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless luc...@mikemccandless.com wrote: +1, but can you change ... if the Document was indexed with DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY? Thanks Shai! Mike McCandless http://blog.mikemccandless.com On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote: Here's the patch with test: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1423774) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; +import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.Bits; // javadocs @@ -47,10 +48,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the document was + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * the result of this method is undefined. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java === --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (revision 1423774) +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (working copy) @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashSet; import java.util.Iterator; +import java.util.Random; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.codecs.Codec; @@ -31,7 +32,9 @@ import org.apache.lucene.codecs.TermsConsumer; import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; @@ -630,4 +633,33 @@ } consumer.close(); } + + public void testDocsOnlyFreq() throws Exception { +// tests that when fields are indexed with DOCS_ONLY, the Codec +// returns 1 in docsEnum.freq() +Directory dir = newDirectory(); +Random random = random(); +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( +TEST_VERSION_CURRENT, new MockAnalyzer(random))); +// we don't need many documents to assert this, but don't use one document either +int numDocs = atLeast(random, 50); +for (int i = 0; i numDocs; i++) { + Document doc = new Document(); + doc.add(new StringField(f, doc, Store.NO)); + writer.addDocument(doc); +} +writer.close(); + +Term term = new Term(f, new BytesRef(doc)); +DirectoryReader reader = DirectoryReader.open(dir); +for (AtomicReaderContext ctx : reader.leaves()) { + DocsEnum de = ctx.reader().termDocsEnum(term); + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { +assertEquals(wrong freq for doc + de.docID(), 1, de.freq()); + } +} +reader.close(); + +dir.close(); + } } BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Shai On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote: Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? I'm not sure! We should make a test fix any failing ones ... It may be more than just changing the documentation... Right. Why would
Re: DocsEnum.freq()
+1 On Wed, Dec 19, 2012 at 7:42 AM, Shai Erera ser...@gmail.com wrote: Thanks for catching that Mike. Will change and commit. Shai On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless luc...@mikemccandless.com wrote: +1, but can you change ... if the Document was indexed with DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY? Thanks Shai! Mike McCandless http://blog.mikemccandless.com On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote: Here's the patch with test: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1423774) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; +import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.Bits; // javadocs @@ -47,10 +48,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the document was + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before + * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * the result of this method is undefined. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java === --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (revision 1423774) +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java (working copy) @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.HashSet; import java.util.Iterator; +import java.util.Random; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.codecs.Codec; @@ -31,7 +32,9 @@ import org.apache.lucene.codecs.TermsConsumer; import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.search.DocIdSetIterator; @@ -630,4 +633,33 @@ } consumer.close(); } + + public void testDocsOnlyFreq() throws Exception { +// tests that when fields are indexed with DOCS_ONLY, the Codec +// returns 1 in docsEnum.freq() +Directory dir = newDirectory(); +Random random = random(); +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( +TEST_VERSION_CURRENT, new MockAnalyzer(random))); +// we don't need many documents to assert this, but don't use one document either +int numDocs = atLeast(random, 50); +for (int i = 0; i numDocs; i++) { + Document doc = new Document(); + doc.add(new StringField(f, doc, Store.NO)); + writer.addDocument(doc); +} +writer.close(); + +Term term = new Term(f, new BytesRef(doc)); +DirectoryReader reader = DirectoryReader.open(dir); +for (AtomicReaderContext ctx : reader.leaves()) { + DocsEnum de = ctx.reader().termDocsEnum(term); + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { +assertEquals(wrong freq for doc + de.docID(), 1, de.freq()); + } +} +reader.close(); + +dir.close(); + } } BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Shai On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote: Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? I'm not sure! We should make a test fix any failing ones
Re: DocsEnum.freq()
BTW, I don't know if it should be like that or not, but I ran this test with -Dtests.iters=1000 and printed at the end of each iteration Codec.getDefault(). For all iterations it printed the same Codec, but if I ran the test many times (no iteration), it printed different Codecs in each run. I thought that tests.iters should simulate the behavior of running the test many times? And therefore pick a new seed + Codec (among other things) for each iteration? Sorry, I missed this. What you're observing is correct -- the Codec is picked at the class level (in TestRuleSetupAndRestoreClassEnv.java). -Dtests.iters duplicates each test much like if you put two or more test methods in a suite class. Any static rules and static hooks execute *once* for a class, regardless of the number of individual tests, so the codec is picked once, depending on the master seed (not the per-method derivative seed). This would be of course solved if tests.dups could run each class with a different seed. I wish I had more time to look into this; I kind of know how to solve this I think but it'll require a significant effort put into rewriting core parts of the runner. Also, this will only work if you run it from command-line (an ant task); there is no way to duplicate an entire test suite in Eclipse or other IDEs. I'll try to look into this, be patient. Apologies for the inconvenience it causes. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? It may be more than just changing the documentation... Why would e.g. TermQuery need to write specialized code for these cases? I looked at TermScorer, and its freq() just returns docsEnum.freq(). I think that Similarity may be affected? Which brings the question - how do Similarity impls know what flags the DE was opened with, and shouldn't they be specialized? E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as an index to an array, so clearly it assumes it is = 0 and also scoreCache.length. So I wonder what will happen to it when someone's Codec will return a negative value or MAX_INT in case frequencies aren't needed? I do realize that you shouldn't call Similarity with missing information, and TermWeight obtains a DocsEnum with frequencies, so in that regard it is safe. And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what you're doing and don't pass a random freq() to Similarity. I lean towards documenting the spec from above, and ensuring that all Codecs return 1 for DOCS_ONLY. If in the future we'll need to handle the case where someone receives a DocsEnum which it needs to consume, and doesn't know which flags were used to open it, we can always add a getFlags to DE. Shai On Mon, Dec 17, 2012 at 11:30 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera ser...@gmail.com wrote: How do these two go together? I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we don't know): lots of places would otherwise have to be special cased for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. and I'm also not sure that all codecs return 1 today if the fields was indexed with DOCS_ONLY ... That just makes it even worse right? I.e., we have code today that relies no that behavior, but we're not sure it works w/ all Codecs? Sorry, for my last sentence above I think I meant I'm also not sure that all codecs return 1 today if you ask for FLAG_NONE. Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very hard to write a decent DISI consumer. Sometimes calling nextDoc() returned MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it actually made consumers' life easier, I think? Right, locking down the API makes total sense in general. It's ok if we say that for DOCS_ONLY you have to return 1. That's even 99.9% of the time the correct value to return (unless someone adds e.g. the same StringField twice to the document). Right. And it's also ok to say that if you passed FLAG_NONE, freq()'s value is unspecified. I think it would be wrong to lie here .. not sure if the consumer always knows how DocsEnum was requested. Not sure if this happens in real life though (consuming a DocsEnum that you didn't obtain yourself), so I'm willing to ignore that case. +1: I think FLAG_NONE should remain undefined. I think we have codecs today that will return 1, 0, the actual doc freq (when the field was indexed as = DOCS_AND_FREQS). These two together sound like a reasonable spec to me? +1 So I think just change your javadocs patch to say that FLAG_NONE means freq is not defined, and if field was indexed as DOCS_ONLY and you asked for FLAG_FREQS then we promise to lie and say freq=1? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote: Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we have a test that can trip bad Codecs? I'm not sure! We should make a test fix any failing ones ... It may be more than just changing the documentation... Right. Why would e.g. TermQuery need to write specialized code for these cases? I looked at TermScorer, and its freq() just returns docsEnum.freq(). I meant if we did not adopt this spec (freq() will lie and return 1 when the field was indexed as DOCS_ONLY), then e.g. TermQuery would need specialized code. I think that Similarity may be affected? Which brings the question - how do Similarity impls know what flags the DE was opened with, and shouldn't they be specialized? E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as an index to an array, so clearly it assumes it is = 0 and also scoreCache.length. So I wonder what will happen to it when someone's Codec will return a negative value or MAX_INT in case frequencies aren't needed? Well, if you passed FLAGS_NONE when you opened the DE then it's your responsibility to never call freq() ... ie, don't call freq() and pass that to the sim. I do realize that you shouldn't call Similarity with missing information, and TermWeight obtains a DocsEnum with frequencies, so in that regard it is safe. And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what you're doing and don't pass a random freq() to Similarity. Right. I lean towards documenting the spec from above, and ensuring that all Codecs return 1 for DOCS_ONLY. +1 So freq() is undefined if you had passed FLAGS_NONE, and we will lie and say freq=1 (need a test verifying this) if the field was indexed as DOCS_ONLY. If in the future we'll need to handle the case where someone receives a DocsEnum which it needs to consume, and doesn't know which flags were used to open it, we can always add a getFlags to DE. Yeah ... Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
DocsEnum.freq()
Hi While migrating code to Lucene 4.0, I noticed that I have an assert on a field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me thinking ... why? If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we lie to the consumer? Rather, we could just return 0 or -1? I personally don't mind if we continue to return 1, if there's a real reason to. I don't think that anyone should call freq() if he asked for DocsEnum with FLAG_NONE. But if we do keep the current behavior, can we at least document it? E.g., something like this patch: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1422804) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -47,10 +47,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the + * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call this + * before {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * this method returns 1. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Shai
Re: DocsEnum.freq()
On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote: Hi While migrating code to Lucene 4.0, I noticed that I have an assert on a field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me thinking ... why? 1 is pretty intuitive i think. i think of omitting freqs for things like super-short fields where the frequency is expected to be 1 anyway. I see your point about lying, but if we change this, we have to change all code using it like TermQuery to not call freqs if they are unavailable: i think this makes the apis pretty difficult to use. But if we do keep the current behavior, can we at least document it? E.g., something like this patch: +1 to the patch. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
I think the FLAG_NONE (I don't need/want freqs when reading the index) and the DOCS_ONLY (Do not index freqs) are two different cases? I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we don't know): lots of places would otherwise have to be special cased for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. But, for FLAG_NONE, when the caller passes this it means they have no intention of using/calling freq() right? Eg MultiTermQueryWrapperFilter would pass this. For that case I'm not sure we should promise / require that codecs return 1 always? EG what if the index does has freqs? I think in that case the codec shouldn't be required to go out of its way and return 1? I'm also not sure that all codecs return 1 today if the fields was indexed with DOCS_ONLY ... Mike McCandless http://blog.mikemccandless.com On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote: Hi While migrating code to Lucene 4.0, I noticed that I have an assert on a field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me thinking ... why? If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we lie to the consumer? Rather, we could just return 0 or -1? I personally don't mind if we continue to return 1, if there's a real reason to. I don't think that anyone should call freq() if he asked for DocsEnum with FLAG_NONE. But if we do keep the current behavior, can we at least document it? E.g., something like this patch: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1422804) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -47,10 +47,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the + * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call this + * before {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * this method returns 1. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
How do these two go together? I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we don't know): lots of places would otherwise have to be special cased for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. and I'm also not sure that all codecs return 1 today if the fields was indexed with DOCS_ONLY ... That just makes it even worse right? I.e., we have code today that relies no that behavior, but we're not sure it works w/ all Codecs? Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very hard to write a decent DISI consumer. Sometimes calling nextDoc() returned MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it actually made consumers' life easier, I think? It's ok if we say that for DOCS_ONLY you have to return 1. That's even 99.9% of the time the correct value to return (unless someone adds e.g. the same StringField twice to the document). And it's also ok to say that if you passed FLAG_NONE, freq()'s value is unspecified. I think it would be wrong to lie here .. not sure if the consumer always knows how DocsEnum was requested. Not sure if this happens in real life though (consuming a DocsEnum that you didn't obtain yourself), so I'm willing to ignore that case. These two together sound like a reasonable spec to me? Shai On Mon, Dec 17, 2012 at 7:16 PM, Michael McCandless luc...@mikemccandless.com wrote: I think the FLAG_NONE (I don't need/want freqs when reading the index) and the DOCS_ONLY (Do not index freqs) are two different cases? I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we don't know): lots of places would otherwise have to be special cased for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. But, for FLAG_NONE, when the caller passes this it means they have no intention of using/calling freq() right? Eg MultiTermQueryWrapperFilter would pass this. For that case I'm not sure we should promise / require that codecs return 1 always? EG what if the index does has freqs? I think in that case the codec shouldn't be required to go out of its way and return 1? I'm also not sure that all codecs return 1 today if the fields was indexed with DOCS_ONLY ... Mike McCandless http://blog.mikemccandless.com On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote: Hi While migrating code to Lucene 4.0, I noticed that I have an assert on a field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me thinking ... why? If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we lie to the consumer? Rather, we could just return 0 or -1? I personally don't mind if we continue to return 1, if there's a real reason to. I don't think that anyone should call freq() if he asked for DocsEnum with FLAG_NONE. But if we do keep the current behavior, can we at least document it? E.g., something like this patch: Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java === --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision 1422804) +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working copy) @@ -47,10 +47,16 @@ protected DocsEnum() { } - /** Returns term frequency in the current document. Do - * not call this before {@link #nextDoc} is first called, - * nor after {@link #nextDoc} returns NO_MORE_DOCS. - **/ + /** + * Returns term frequency in the current document, or 1 if the + * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call this + * before {@link #nextDoc} is first called, nor after {@link #nextDoc} returns + * {@link DocIdSetIterator#NO_MORE_DOCS}. + * + * p + * bNOTE:/b if the {@link DocsEnum} was obtain with {@link #FLAG_NONE}, + * this method returns 1. + */ public abstract int freq() throws IOException; /** Returns the related attributes. */ Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: DocsEnum.freq()
On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera ser...@gmail.com wrote: How do these two go together? I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we don't know): lots of places would otherwise have to be special cased for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS. and I'm also not sure that all codecs return 1 today if the fields was indexed with DOCS_ONLY ... That just makes it even worse right? I.e., we have code today that relies no that behavior, but we're not sure it works w/ all Codecs? Sorry, for my last sentence above I think I meant I'm also not sure that all codecs return 1 today if you ask for FLAG_NONE. Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very hard to write a decent DISI consumer. Sometimes calling nextDoc() returned MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it actually made consumers' life easier, I think? Right, locking down the API makes total sense in general. It's ok if we say that for DOCS_ONLY you have to return 1. That's even 99.9% of the time the correct value to return (unless someone adds e.g. the same StringField twice to the document). Right. And it's also ok to say that if you passed FLAG_NONE, freq()'s value is unspecified. I think it would be wrong to lie here .. not sure if the consumer always knows how DocsEnum was requested. Not sure if this happens in real life though (consuming a DocsEnum that you didn't obtain yourself), so I'm willing to ignore that case. +1: I think FLAG_NONE should remain undefined. I think we have codecs today that will return 1, 0, the actual doc freq (when the field was indexed as = DOCS_AND_FREQS). These two together sound like a reasonable spec to me? +1 So I think just change your javadocs patch to say that FLAG_NONE means freq is not defined, and if field was indexed as DOCS_ONLY and you asked for FLAG_FREQS then we promise to lie and say freq=1? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org