[jira] [Created] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Kai Chan (JIRA)
Kai Chan created LUCENE-5315:


 Summary: Some subclasses of Scorer do not honor the contract of 
DocsEnum.freq()
 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan


The behavior of Scorer.freq() is inconsistent across its subclasses:
* For TermScorer, the freq() method behaves just as DocsEnum.freq() specifies, 
i.e. the method returns the term frequency in the current document.
* For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
DisjunctionSumScorer, and possibly other classes, the freq() method returns the 
number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
current document.

This difference makes the meaning of Scorer.freq()'s return value uncertain. To 
add to the uncertainty, given a Query, there seems to be no way of knowing 
which behavior takes effect (as that is not specified in the API reference) 
except by reading or running the code.

This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Kai Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Chan updated LUCENE-5315:
-

Attachment: DisjunctionMaxQueryTest.java

This is a very simple program that illustrates the problem. I am attaching the 
output in a separate file.

 Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
 --

 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan
 Attachments: DisjunctionMaxQueryTest.java


 The behavior of Scorer.freq() is inconsistent across its subclasses:
 * For TermScorer, the freq() method behaves just as DocsEnum.freq() 
 specifies, i.e. the method returns the term frequency in the current 
 document.
 * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
 DisjunctionSumScorer, and possibly other classes, the freq() method returns 
 the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
 current document.
 This difference makes the meaning of Scorer.freq()'s return value uncertain. 
 To add to the uncertainty, given a Query, there seems to be no way of knowing 
 which behavior takes effect (as that is not specified in the API reference) 
 except by reading or running the code.
 This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Kai Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Chan updated LUCENE-5315:
-

Attachment: DisjunctionMaxQueryTest.java.output

Here is the output of DisjunctionMaxQueryTest. For the freq() method, the 
return values (3, 4, 5) from TermScorer are term frequencies, while the return 
values (1, 2) from ConjunctionScorer and others are clause counts.

 Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
 --

 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan
 Attachments: DisjunctionMaxQueryTest.java, 
 DisjunctionMaxQueryTest.java.output


 The behavior of Scorer.freq() is inconsistent across its subclasses:
 * For TermScorer, the freq() method behaves just as DocsEnum.freq() 
 specifies, i.e. the method returns the term frequency in the current 
 document.
 * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
 DisjunctionSumScorer, and possibly other classes, the freq() method returns 
 the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
 current document.
 This difference makes the meaning of Scorer.freq()'s return value uncertain. 
 To add to the uncertainty, given a Query, there seems to be no way of knowing 
 which behavior takes effect (as that is not specified in the API reference) 
 except by reading or running the code.
 This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808586#comment-13808586
 ] 

Robert Muir commented on LUCENE-5315:
-

I dont see the bug, as it said in  LUCENE-4514, its the number of matches in 
the document for that scorer.

So if its a Termscorer its the number of occurrences of that term. if its a 
phrasescorer its the number of occurrences of that phrase. If its a 
booleanscorer, its the number of occurrences of that boolean conjunct.

It does not try to 'sum' or anything. if you want to that, you can easily do it 
since you can navigate subscorers with getChildren.

 Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
 --

 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan
 Attachments: DisjunctionMaxQueryTest.java, 
 DisjunctionMaxQueryTest.java.output


 The behavior of Scorer.freq() is inconsistent across its subclasses:
 * For TermScorer, the freq() method behaves just as DocsEnum.freq() 
 specifies, i.e. the method returns the term frequency in the current 
 document.
 * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
 DisjunctionSumScorer, and possibly other classes, the freq() method returns 
 the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
 current document.
 This difference makes the meaning of Scorer.freq()'s return value uncertain. 
 To add to the uncertainty, given a Query, there seems to be no way of knowing 
 which behavior takes effect (as that is not specified in the API reference) 
 except by reading or running the code.
 This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Kai Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808607#comment-13808607
 ] 

Kai Chan commented on LUCENE-5315:
--

This is how I see it. I could be wrong though, and please feel free to point 
out if it is the case.

# The API reference of DocsEnum.freq() states that its return value is term 
frequency in the current document, or 1 if the field was indexed with 
FieldInfo.IndexOptions.DOCS_ONLY. (I do not think the second part applies here 
though, just the first part.)
# Scorer and its subclasses extend DocEnum, and therefore their freq() methods 
override DocsEnum.freq().
# Because of (1) and (2), I thought the freq() method of Scorer and its 
subclasses should also return term frequency in the current document, to 
honor the contract set by DocsEnum.freq().
# However, for some of Scorer's subclasses, (3) is not how the classes are 
currently implemented.

 Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
 --

 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan
 Attachments: DisjunctionMaxQueryTest.java, 
 DisjunctionMaxQueryTest.java.output


 The behavior of Scorer.freq() is inconsistent across its subclasses:
 * For TermScorer, the freq() method behaves just as DocsEnum.freq() 
 specifies, i.e. the method returns the term frequency in the current 
 document.
 * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
 DisjunctionSumScorer, and possibly other classes, the freq() method returns 
 the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
 current document.
 This difference makes the meaning of Scorer.freq()'s return value uncertain. 
 To add to the uncertainty, given a Query, there seems to be no way of knowing 
 which behavior takes effect (as that is not specified in the API reference) 
 except by reading or running the code.
 This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5315) Some subclasses of Scorer do not honor the contract of DocsEnum.freq()

2013-10-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808615#comment-13808615
 ] 

Robert Muir commented on LUCENE-5315:
-

i think DocsEnum.freq() should be changed to say matches.

According to your strict definition, PhraseScorer is broken too, because it 
doesnt return term frequencies :)

 Some subclasses of Scorer do not honor the contract of DocsEnum.freq()
 --

 Key: LUCENE-5315
 URL: https://issues.apache.org/jira/browse/LUCENE-5315
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.5.1
Reporter: Kai Chan
 Attachments: DisjunctionMaxQueryTest.java, 
 DisjunctionMaxQueryTest.java.output


 The behavior of Scorer.freq() is inconsistent across its subclasses:
 * For TermScorer, the freq() method behaves just as DocsEnum.freq() 
 specifies, i.e. the method returns the term frequency in the current 
 document.
 * For BooleanScorer2, ConjunctionScorer, DisjunctionMaxScorer, 
 DisjunctionSumScorer, and possibly other classes, the freq() method returns 
 the number of clauses (in BooleanQuery or DisjunctionMaxQuery) that match the 
 current document.
 This difference makes the meaning of Scorer.freq()'s return value uncertain. 
 To add to the uncertainty, given a Query, there seems to be no way of knowing 
 which behavior takes effect (as that is not specified in the API reference) 
 except by reading or running the code.
 This issue might be related to LUCENE-4514.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: DocsEnum.freq()

2012-12-19 Thread Shai Erera
Here's the patch with test:

Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
===
--- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
1423774)
+++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
copy)
@@ -19,6 +19,7 @@

 import java.io.IOException;

+import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.util.AttributeSource;
 import org.apache.lucene.util.Bits; // javadocs
@@ -47,10 +48,16 @@
   protected DocsEnum() {
   }

-  /** Returns term frequency in the current document.  Do
-   *  not call this before {@link #nextDoc} is first called,
-   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
-   **/
+  /**
+   * Returns term frequency in the current document, or 1 if the document
was
+   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
+   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns
+   * {@link DocIdSetIterator#NO_MORE_DOCS}.
+   *
+   * p
+   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
#FLAG_NONE},
+   * the result of this method is undefined.
+   */
   public abstract int freq() throws IOException;

   /** Returns the related attributes. */
Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
===
--- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(revision 1423774)
+++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(working copy)
@@ -21,6 +21,7 @@
 import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Iterator;
+import java.util.Random;

 import org.apache.lucene.analysis.MockAnalyzer;
 import org.apache.lucene.codecs.Codec;
@@ -31,7 +32,9 @@
 import org.apache.lucene.codecs.TermsConsumer;
 import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
 import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field.Store;
 import org.apache.lucene.document.FieldType;
+import org.apache.lucene.document.StringField;
 import org.apache.lucene.document.TextField;
 import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
@@ -630,4 +633,33 @@
 }
 consumer.close();
   }
+
+  public void testDocsOnlyFreq() throws Exception {
+// tests that when fields are indexed with DOCS_ONLY, the Codec
+// returns 1 in docsEnum.freq()
+Directory dir = newDirectory();
+Random random = random();
+IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
+TEST_VERSION_CURRENT, new MockAnalyzer(random)));
+// we don't need many documents to assert this, but don't use one
document either
+int numDocs = atLeast(random, 50);
+for (int i = 0; i  numDocs; i++) {
+  Document doc = new Document();
+  doc.add(new StringField(f, doc, Store.NO));
+  writer.addDocument(doc);
+}
+writer.close();
+
+Term term = new Term(f, new BytesRef(doc));
+DirectoryReader reader = DirectoryReader.open(dir);
+for (AtomicReaderContext ctx : reader.leaves()) {
+  DocsEnum de = ctx.reader().termDocsEnum(term);
+  while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+assertEquals(wrong freq for doc  + de.docID(), 1, de.freq());
+  }
+}
+reader.close();
+
+dir.close();
+  }
 }

BTW, I don't know if it should be like that or not, but I ran this test
with -Dtests.iters=1000 and printed at the end of each iteration
Codec.getDefault().
For all iterations it printed the same Codec, but if I ran the test many
times (no iteration), it printed different Codecs in each run.
I thought that tests.iters should simulate the behavior of running the test
many times? And therefore pick a new seed + Codec (among other things) for
each iteration?

Shai


On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote:
  Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
 we
  have a test that can trip bad Codecs?

 I'm not sure!  We should make a test  fix any failing ones ...

  It may be more than just changing the documentation...

 Right.

  Why would e.g. TermQuery need to write specialized code for these cases?
 I
  looked at TermScorer, and its freq() just returns docsEnum.freq().

 I meant if we did not adopt this spec (freq() will lie and return 1
 when the field was indexed as DOCS_ONLY), then e.g. TermQuery would
 need specialized code.

  I think that Similarity may be affected? Which brings the question - how
 do
  Similarity impls know what flags the DE was opened with, and shouldn't
 they
  be specialized?
  E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score()
 as
  an index to an array, so clearly it assumes it is = 0 and also 
  scoreCache.length.
  So I

Re: DocsEnum.freq()

2012-12-19 Thread Michael McCandless
+1, but can you change ... if the Document was indexed with
DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY?

Thanks Shai!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote:
 Here's the patch with test:

 Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
 ===
 --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
 1423774)
 +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
 copy)
 @@ -19,6 +19,7 @@

  import java.io.IOException;

 +import org.apache.lucene.index.FieldInfo.IndexOptions;
  import org.apache.lucene.search.DocIdSetIterator;
  import org.apache.lucene.util.AttributeSource;
  import org.apache.lucene.util.Bits; // javadocs
 @@ -47,10 +48,16 @@

protected DocsEnum() {
}

 -  /** Returns term frequency in the current document.  Do
 -   *  not call this before {@link #nextDoc} is first called,
 -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
 -   **/
 +  /**
 +   * Returns term frequency in the current document, or 1 if the document
 was
 +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
 +   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns

 +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
 +   *
 +   * p
 +   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
 #FLAG_NONE},
 +   * the result of this method is undefined.

 +   */
public abstract int freq() throws IOException;

/** Returns the related attributes. */
 Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
 ===
 --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
 (revision 1423774)
 +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
 (working copy)
 @@ -21,6 +21,7 @@
  import java.util.Arrays;
  import java.util.HashSet;
  import java.util.Iterator;
 +import java.util.Random;

  import org.apache.lucene.analysis.MockAnalyzer;
  import org.apache.lucene.codecs.Codec;
 @@ -31,7 +32,9 @@
  import org.apache.lucene.codecs.TermsConsumer;
  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
  import org.apache.lucene.document.Document;
 +import org.apache.lucene.document.Field.Store;
  import org.apache.lucene.document.FieldType;
 +import org.apache.lucene.document.StringField;
  import org.apache.lucene.document.TextField;
  import org.apache.lucene.index.FieldInfo.IndexOptions;
  import org.apache.lucene.search.DocIdSetIterator;
 @@ -630,4 +633,33 @@
  }
  consumer.close();
}
 +
 +  public void testDocsOnlyFreq() throws Exception {
 +// tests that when fields are indexed with DOCS_ONLY, the Codec
 +// returns 1 in docsEnum.freq()
 +Directory dir = newDirectory();
 +Random random = random();
 +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
 +TEST_VERSION_CURRENT, new MockAnalyzer(random)));
 +// we don't need many documents to assert this, but don't use one
 document either
 +int numDocs = atLeast(random, 50);
 +for (int i = 0; i  numDocs; i++) {
 +  Document doc = new Document();
 +  doc.add(new StringField(f, doc, Store.NO));
 +  writer.addDocument(doc);
 +}
 +writer.close();
 +
 +Term term = new Term(f, new BytesRef(doc));
 +DirectoryReader reader = DirectoryReader.open(dir);
 +for (AtomicReaderContext ctx : reader.leaves()) {
 +  DocsEnum de = ctx.reader().termDocsEnum(term);
 +  while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
 +assertEquals(wrong freq for doc  + de.docID(), 1, de.freq());
 +  }
 +}
 +reader.close();
 +
 +dir.close();
 +  }
  }

 BTW, I don't know if it should be like that or not, but I ran this test with
 -Dtests.iters=1000 and printed at the end of each iteration
 Codec.getDefault().
 For all iterations it printed the same Codec, but if I ran the test many
 times (no iteration), it printed different Codecs in each run.
 I thought that tests.iters should simulate the behavior of running the test
 many times? And therefore pick a new seed + Codec (among other things) for
 each iteration?

 Shai


 On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote:
  Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
  we
  have a test that can trip bad Codecs?

 I'm not sure!  We should make a test  fix any failing ones ...

  It may be more than just changing the documentation...

 Right.

  Why would e.g. TermQuery need to write specialized code for these cases?
  I
  looked at TermScorer, and its freq() just returns docsEnum.freq().

 I meant if we did not adopt this spec (freq() will lie and return 1
 when the field was indexed as DOCS_ONLY), then e.g. TermQuery would
 need specialized

Re: DocsEnum.freq()

2012-12-19 Thread Shai Erera
Thanks for catching that Mike. Will change and commit.

Shai


On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 +1, but can you change ... if the Document was indexed with
 DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY?

 Thanks Shai!

 Mike McCandless

 http://blog.mikemccandless.com

 On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote:
  Here's the patch with test:
 
  Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
  ===
  --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
  1423774)
  +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
  copy)
  @@ -19,6 +19,7 @@
 
   import java.io.IOException;
 
  +import org.apache.lucene.index.FieldInfo.IndexOptions;
   import org.apache.lucene.search.DocIdSetIterator;
   import org.apache.lucene.util.AttributeSource;
   import org.apache.lucene.util.Bits; // javadocs
  @@ -47,10 +48,16 @@
 
 protected DocsEnum() {
 }
 
  -  /** Returns term frequency in the current document.  Do
  -   *  not call this before {@link #nextDoc} is first called,
  -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
  -   **/
  +  /**
  +   * Returns term frequency in the current document, or 1 if the
 document
  was
  +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
 before
  +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
 returns
 
  +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
  +   *
  +   * p
  +   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
  #FLAG_NONE},
  +   * the result of this method is undefined.
 
  +   */
 public abstract int freq() throws IOException;
 
 /** Returns the related attributes. */
  Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  ===
  --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  (revision 1423774)
  +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  (working copy)
  @@ -21,6 +21,7 @@
   import java.util.Arrays;
   import java.util.HashSet;
   import java.util.Iterator;
  +import java.util.Random;
 
   import org.apache.lucene.analysis.MockAnalyzer;
   import org.apache.lucene.codecs.Codec;
  @@ -31,7 +32,9 @@
   import org.apache.lucene.codecs.TermsConsumer;
   import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
   import org.apache.lucene.document.Document;
  +import org.apache.lucene.document.Field.Store;
   import org.apache.lucene.document.FieldType;
  +import org.apache.lucene.document.StringField;
   import org.apache.lucene.document.TextField;
   import org.apache.lucene.index.FieldInfo.IndexOptions;
   import org.apache.lucene.search.DocIdSetIterator;
  @@ -630,4 +633,33 @@
   }
   consumer.close();
 }
  +
  +  public void testDocsOnlyFreq() throws Exception {
  +// tests that when fields are indexed with DOCS_ONLY, the Codec
  +// returns 1 in docsEnum.freq()
  +Directory dir = newDirectory();
  +Random random = random();
  +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
  +TEST_VERSION_CURRENT, new MockAnalyzer(random)));
  +// we don't need many documents to assert this, but don't use one
  document either
  +int numDocs = atLeast(random, 50);
  +for (int i = 0; i  numDocs; i++) {
  +  Document doc = new Document();
  +  doc.add(new StringField(f, doc, Store.NO));
  +  writer.addDocument(doc);
  +}
  +writer.close();
  +
  +Term term = new Term(f, new BytesRef(doc));
  +DirectoryReader reader = DirectoryReader.open(dir);
  +for (AtomicReaderContext ctx : reader.leaves()) {
  +  DocsEnum de = ctx.reader().termDocsEnum(term);
  +  while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
  +assertEquals(wrong freq for doc  + de.docID(), 1, de.freq());
  +  }
  +}
  +reader.close();
  +
  +dir.close();
  +  }
   }
 
  BTW, I don't know if it should be like that or not, but I ran this test
 with
  -Dtests.iters=1000 and printed at the end of each iteration
  Codec.getDefault().
  For all iterations it printed the same Codec, but if I ran the test many
  times (no iteration), it printed different Codecs in each run.
  I thought that tests.iters should simulate the behavior of running the
 test
  many times? And therefore pick a new seed + Codec (among other things)
 for
  each iteration?
 
  Shai
 
 
  On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote:
   Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
 Do
   we
   have a test that can trip bad Codecs?
 
  I'm not sure!  We should make a test  fix any failing ones ...
 
   It may be more than just changing the documentation...
 
  Right.
 
   Why would

Re: DocsEnum.freq()

2012-12-19 Thread Robert Muir
+1

On Wed, Dec 19, 2012 at 7:42 AM, Shai Erera ser...@gmail.com wrote:
 Thanks for catching that Mike. Will change and commit.

 Shai


 On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 +1, but can you change ... if the Document was indexed with
 DOCS_ONLY to ... if the Field was indexed with DOCS_ONLY?

 Thanks Shai!

 Mike McCandless

 http://blog.mikemccandless.com

 On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera ser...@gmail.com wrote:
  Here's the patch with test:
 
  Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
  ===
  --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
  (revision
  1423774)
  +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
  copy)
  @@ -19,6 +19,7 @@
 
   import java.io.IOException;
 
  +import org.apache.lucene.index.FieldInfo.IndexOptions;
   import org.apache.lucene.search.DocIdSetIterator;
   import org.apache.lucene.util.AttributeSource;
   import org.apache.lucene.util.Bits; // javadocs
  @@ -47,10 +48,16 @@
 
 protected DocsEnum() {
 }
 
  -  /** Returns term frequency in the current document.  Do
  -   *  not call this before {@link #nextDoc} is first called,
  -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
  -   **/
  +  /**
  +   * Returns term frequency in the current document, or 1 if the
  document
  was
  +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
  before
  +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
  returns
 
  +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
  +   *
  +   * p
  +   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
  #FLAG_NONE},
  +   * the result of this method is undefined.
 
  +   */
 public abstract int freq() throws IOException;
 
 /** Returns the related attributes. */
  Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  ===
  --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  (revision 1423774)
  +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
  (working copy)
  @@ -21,6 +21,7 @@
   import java.util.Arrays;
   import java.util.HashSet;
   import java.util.Iterator;
  +import java.util.Random;
 
   import org.apache.lucene.analysis.MockAnalyzer;
   import org.apache.lucene.codecs.Codec;
  @@ -31,7 +32,9 @@
   import org.apache.lucene.codecs.TermsConsumer;
   import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
   import org.apache.lucene.document.Document;
  +import org.apache.lucene.document.Field.Store;
   import org.apache.lucene.document.FieldType;
  +import org.apache.lucene.document.StringField;
   import org.apache.lucene.document.TextField;
   import org.apache.lucene.index.FieldInfo.IndexOptions;
   import org.apache.lucene.search.DocIdSetIterator;
  @@ -630,4 +633,33 @@
   }
   consumer.close();
 }
  +
  +  public void testDocsOnlyFreq() throws Exception {
  +// tests that when fields are indexed with DOCS_ONLY, the Codec
  +// returns 1 in docsEnum.freq()
  +Directory dir = newDirectory();
  +Random random = random();
  +IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
  +TEST_VERSION_CURRENT, new MockAnalyzer(random)));
  +// we don't need many documents to assert this, but don't use one
  document either
  +int numDocs = atLeast(random, 50);
  +for (int i = 0; i  numDocs; i++) {
  +  Document doc = new Document();
  +  doc.add(new StringField(f, doc, Store.NO));
  +  writer.addDocument(doc);
  +}
  +writer.close();
  +
  +Term term = new Term(f, new BytesRef(doc));
  +DirectoryReader reader = DirectoryReader.open(dir);
  +for (AtomicReaderContext ctx : reader.leaves()) {
  +  DocsEnum de = ctx.reader().termDocsEnum(term);
  +  while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
  +assertEquals(wrong freq for doc  + de.docID(), 1, de.freq());
  +  }
  +}
  +reader.close();
  +
  +dir.close();
  +  }
   }
 
  BTW, I don't know if it should be like that or not, but I ran this test
  with
  -Dtests.iters=1000 and printed at the end of each iteration
  Codec.getDefault().
  For all iterations it printed the same Codec, but if I ran the test many
  times (no iteration), it printed different Codecs in each run.
  I thought that tests.iters should simulate the behavior of running the
  test
  many times? And therefore pick a new seed + Codec (among other things)
  for
  each iteration?
 
  Shai
 
 
  On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote:
   Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
   Do
   we
   have a test that can trip bad Codecs?
 
  I'm not sure!  We should make a test  fix any failing ones

Re: DocsEnum.freq()

2012-12-19 Thread Dawid Weiss
 BTW, I don't know if it should be like that or not, but I ran this test with
 -Dtests.iters=1000 and printed at the end of each iteration
 Codec.getDefault().
 For all iterations it printed the same Codec, but if I ran the test many
 times (no iteration), it printed different Codecs in each run.
 I thought that tests.iters should simulate the behavior of running the test
 many times? And therefore pick a new seed + Codec (among other things) for
 each iteration?

Sorry, I missed this. What you're observing is correct -- the Codec is
picked at the class level (in TestRuleSetupAndRestoreClassEnv.java).
-Dtests.iters duplicates each test much like if you put two or more
test methods in a suite class. Any static rules and static hooks
execute *once* for a class, regardless of the number of individual
tests, so the codec is picked once, depending on the master seed (not
the per-method derivative seed).

This would be of course solved if tests.dups could run each class with
a different seed. I wish I had more time to look into this; I kind of
know how to solve this I think but it'll require a significant effort
put into rewriting core parts of the runner. Also, this will only work
if you run it from command-line (an ant task); there is no way to
duplicate an entire test suite in Eclipse or other IDEs.

I'll try to look into this, be patient. Apologies for the
inconvenience it causes.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: DocsEnum.freq()

2012-12-18 Thread Shai Erera
Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we
have a test that can trip bad Codecs?
It may be more than just changing the documentation...

Why would e.g. TermQuery need to write specialized code for these cases? I
looked at TermScorer, and its freq() just returns docsEnum.freq().

I think that Similarity may be affected? Which brings the question - how do
Similarity impls know what flags the DE was opened with, and shouldn't they
be specialized?
E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as
an index to an array, so clearly it assumes it is = 0 and also 
scoreCache.length.
So I wonder what will happen to it when someone's Codec will return a
negative value or MAX_INT in case frequencies aren't needed?

I do realize that you shouldn't call Similarity with missing information,
and TermWeight obtains a DocsEnum with frequencies, so in that regard it is
safe.
And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
you're doing and don't pass a random freq() to Similarity.

I lean towards documenting the spec from above, and ensuring that all
Codecs return 1 for DOCS_ONLY.

If in the future we'll need to handle the case where someone receives a
DocsEnum which it needs to consume, and doesn't know which flags were used
to open it, we can always add a getFlags to DE.

Shai


On Mon, Dec 17, 2012 at 11:30 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera ser...@gmail.com wrote:
  How do these two go together?
 
  I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
  don't know): lots of places would otherwise have to be special cased
  for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.
 
 
  and
 
  I'm also not sure that
  all codecs return 1 today if the fields was indexed with DOCS_ONLY ...
 
 
   That just makes it even worse right? I.e., we have code today that
 relies
  no that behavior, but we're not sure it works w/ all Codecs?

 Sorry, for my last sentence above I think I meant I'm also not sure
 that all codecs return 1 today if you ask for FLAG_NONE.

  Remember that DocIdSetIterator.nextDoc() was loosely specified? It was
 very
  hard to write a decent DISI consumer. Sometimes calling nextDoc()
 returned
  MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
  actually made consumers' life easier, I think?

 Right, locking down the API makes total sense in general.

  It's ok if we say that for DOCS_ONLY you have to return 1. That's even
 99.9%
  of the time the correct value to return (unless someone adds e.g. the
 same
  StringField twice to the document).

 Right.

  And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
  unspecified. I think it would be wrong to lie here .. not sure if the
  consumer always knows how DocsEnum was requested. Not sure if this
 happens
  in real life though (consuming a DocsEnum that you didn't obtain
 yourself),
  so I'm willing to ignore that case.

 +1: I think FLAG_NONE should remain undefined.  I think we have codecs
 today that will return 1, 0, the actual doc freq (when the field was
 indexed as = DOCS_AND_FREQS).

  These two together sound like a reasonable spec to me?

 +1

 So I think just change your javadocs patch to say that FLAG_NONE means
 freq is not defined, and if field was indexed as DOCS_ONLY and you
 asked for FLAG_FREQS then we promise to lie and say freq=1?

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: DocsEnum.freq()

2012-12-18 Thread Michael McCandless
On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera ser...@gmail.com wrote:
 Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we
 have a test that can trip bad Codecs?

I'm not sure!  We should make a test  fix any failing ones ...

 It may be more than just changing the documentation...

Right.

 Why would e.g. TermQuery need to write specialized code for these cases? I
 looked at TermScorer, and its freq() just returns docsEnum.freq().

I meant if we did not adopt this spec (freq() will lie and return 1
when the field was indexed as DOCS_ONLY), then e.g. TermQuery would
need specialized code.

 I think that Similarity may be affected? Which brings the question - how do
 Similarity impls know what flags the DE was opened with, and shouldn't they
 be specialized?
 E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as
 an index to an array, so clearly it assumes it is = 0 and also 
 scoreCache.length.
 So I wonder what will happen to it when someone's Codec will return a
 negative value or MAX_INT in case frequencies aren't needed?

Well, if you passed FLAGS_NONE when you opened the DE then it's your
responsibility to never call freq() ... ie, don't call freq() and pass
that to the sim.

 I do realize that you shouldn't call Similarity with missing information,
 and TermWeight obtains a DocsEnum with frequencies, so in that regard it is
 safe.
 And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
 you're doing and don't pass a random freq() to Similarity.

Right.

 I lean towards documenting the spec from above, and ensuring that all Codecs
 return 1 for DOCS_ONLY.

+1

So freq() is undefined if you had passed FLAGS_NONE, and we will lie
and say freq=1 (need a test verifying this) if the field was indexed
as DOCS_ONLY.

 If in the future we'll need to handle the case where someone receives a
 DocsEnum which it needs to consume, and doesn't know which flags were used
 to open it, we can always add a getFlags to DE.

Yeah ...

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



DocsEnum.freq()

2012-12-17 Thread Shai Erera
Hi

While migrating code to Lucene 4.0, I noticed that I have an assert on a
field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
thinking ... why?

If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
lie to the consumer? Rather, we could just return 0 or -1?

I personally don't mind if we continue to return 1, if there's a real
reason to. I don't think that anyone should call freq() if he asked for
DocsEnum with FLAG_NONE. But if we do keep the current behavior, can we at
least document it?

E.g., something like this patch:

Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
===
--- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
1422804)
+++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
copy)
@@ -47,10 +47,16 @@
   protected DocsEnum() {
   }

-  /** Returns term frequency in the current document.  Do
-   *  not call this before {@link #nextDoc} is first called,
-   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
-   **/
+  /**
+   * Returns term frequency in the current document, or 1 if the
+   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
this
+   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
returns
+   * {@link DocIdSetIterator#NO_MORE_DOCS}.
+   *
+   * p
+   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
#FLAG_NONE},
+   * this method returns 1.
+   */
   public abstract int freq() throws IOException;

   /** Returns the related attributes. */

Shai


Re: DocsEnum.freq()

2012-12-17 Thread Robert Muir
On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 While migrating code to Lucene 4.0, I noticed that I have an assert on a
 field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
 thinking ... why?

1 is pretty intuitive i think. i think of omitting freqs for things
like super-short fields where the frequency is expected to be 1
anyway.

I see your point about lying, but if we change this, we have to
change all code using it like TermQuery to not call freqs if they are
unavailable: i think this makes the apis pretty difficult to use.


 But if we do keep the current behavior, can we at least
 document it?

 E.g., something like this patch:


+1 to the patch.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: DocsEnum.freq()

2012-12-17 Thread Michael McCandless
I think the FLAG_NONE (I don't need/want freqs when reading the
index) and the DOCS_ONLY (Do not index freqs) are two different
cases?

I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
don't know): lots of places would otherwise have to be special cased
for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.

But, for FLAG_NONE, when the caller passes this it means they have no
intention of using/calling freq() right?  Eg
MultiTermQueryWrapperFilter would pass this.  For that case I'm not
sure we should promise / require that codecs return 1 always?  EG what
if the index does has freqs?  I think in that case the codec shouldn't
be required to go out of its way and return 1?  I'm also not sure that
all codecs return 1 today if the fields was indexed with DOCS_ONLY ...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 While migrating code to Lucene 4.0, I noticed that I have an assert on a
 field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
 thinking ... why?

 If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
 lie to the consumer? Rather, we could just return 0 or -1?

 I personally don't mind if we continue to return 1, if there's a real reason
 to. I don't think that anyone should call freq() if he asked for DocsEnum
 with FLAG_NONE. But if we do keep the current behavior, can we at least
 document it?

 E.g., something like this patch:

 Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
 ===
 --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
 1422804)
 +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
 copy)
 @@ -47,10 +47,16 @@
protected DocsEnum() {
}

 -  /** Returns term frequency in the current document.  Do
 -   *  not call this before {@link #nextDoc} is first called,
 -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
 -   **/
 +  /**
 +   * Returns term frequency in the current document, or 1 if the
 +   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
 this
 +   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
 returns
 +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
 +   *
 +   * p
 +   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
 #FLAG_NONE},
 +   * this method returns 1.
 +   */
public abstract int freq() throws IOException;

/** Returns the related attributes. */

 Shai

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: DocsEnum.freq()

2012-12-17 Thread Shai Erera
How do these two go together?

I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
 don't know): lots of places would otherwise have to be special cased
 for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.


and

I'm also not sure that
 all codecs return 1 today if the fields was indexed with DOCS_ONLY ...


 That just makes it even worse right? I.e., we have code today that relies
no that behavior, but we're not sure it works w/ all Codecs?

Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very
hard to write a decent DISI consumer. Sometimes calling nextDoc() returned
MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
actually made consumers' life easier, I think?

It's ok if we say that for DOCS_ONLY you have to return 1. That's even
99.9% of the time the correct value to return (unless someone adds e.g. the
same StringField twice to the document).

And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
unspecified. I think it would be wrong to lie here .. not sure if the
consumer always knows how DocsEnum was requested. Not sure if this happens
in real life though (consuming a DocsEnum that you didn't obtain yourself),
so I'm willing to ignore that case.

These two together sound like a reasonable spec to me?

Shai


On Mon, Dec 17, 2012 at 7:16 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 I think the FLAG_NONE (I don't need/want freqs when reading the
 index) and the DOCS_ONLY (Do not index freqs) are two different
 cases?

 I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
 don't know): lots of places would otherwise have to be special cased
 for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.

 But, for FLAG_NONE, when the caller passes this it means they have no
 intention of using/calling freq() right?  Eg
 MultiTermQueryWrapperFilter would pass this.  For that case I'm not
 sure we should promise / require that codecs return 1 always?  EG what
 if the index does has freqs?  I think in that case the codec shouldn't
 be required to go out of its way and return 1?  I'm also not sure that
 all codecs return 1 today if the fields was indexed with DOCS_ONLY ...

 Mike McCandless

 http://blog.mikemccandless.com

 On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  While migrating code to Lucene 4.0, I noticed that I have an assert on a
  field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got
 me
  thinking ... why?
 
  If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
  lie to the consumer? Rather, we could just return 0 or -1?
 
  I personally don't mind if we continue to return 1, if there's a real
 reason
  to. I don't think that anyone should call freq() if he asked for DocsEnum
  with FLAG_NONE. But if we do keep the current behavior, can we at least
  document it?
 
  E.g., something like this patch:
 
  Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
  ===
  --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
  1422804)
  +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
  copy)
  @@ -47,10 +47,16 @@
 protected DocsEnum() {
 }
 
  -  /** Returns term frequency in the current document.  Do
  -   *  not call this before {@link #nextDoc} is first called,
  -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
  -   **/
  +  /**
  +   * Returns term frequency in the current document, or 1 if the
  +   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
  this
  +   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
  returns
  +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
  +   *
  +   * p
  +   * bNOTE:/b if the {@link DocsEnum} was obtain with {@link
  #FLAG_NONE},
  +   * this method returns 1.
  +   */
 public abstract int freq() throws IOException;
 
 /** Returns the related attributes. */
 
  Shai

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: DocsEnum.freq()

2012-12-17 Thread Michael McCandless
On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera ser...@gmail.com wrote:
 How do these two go together?

 I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
 don't know): lots of places would otherwise have to be special cased
 for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.


 and

 I'm also not sure that
 all codecs return 1 today if the fields was indexed with DOCS_ONLY ...


  That just makes it even worse right? I.e., we have code today that relies
 no that behavior, but we're not sure it works w/ all Codecs?

Sorry, for my last sentence above I think I meant I'm also not sure
that all codecs return 1 today if you ask for FLAG_NONE.

 Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very
 hard to write a decent DISI consumer. Sometimes calling nextDoc() returned
 MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
 actually made consumers' life easier, I think?

Right, locking down the API makes total sense in general.

 It's ok if we say that for DOCS_ONLY you have to return 1. That's even 99.9%
 of the time the correct value to return (unless someone adds e.g. the same
 StringField twice to the document).

Right.

 And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
 unspecified. I think it would be wrong to lie here .. not sure if the
 consumer always knows how DocsEnum was requested. Not sure if this happens
 in real life though (consuming a DocsEnum that you didn't obtain yourself),
 so I'm willing to ignore that case.

+1: I think FLAG_NONE should remain undefined.  I think we have codecs
today that will return 1, 0, the actual doc freq (when the field was
indexed as = DOCS_AND_FREQS).

 These two together sound like a reasonable spec to me?

+1

So I think just change your javadocs patch to say that FLAG_NONE means
freq is not defined, and if field was indexed as DOCS_ONLY and you
asked for FLAG_FREQS then we promise to lie and say freq=1?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org