Thanks for catching that Mike. Will change and commit. Shai
On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > +1, but can you change "... if the Document was indexed with > DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"? > > Thanks Shai! > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote: > > Here's the patch with test: > > > > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java > > =================================================================== > > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (revision > > 1423774) > > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working > > copy) > > @@ -19,6 +19,7 @@ > > > > import java.io.IOException; > > > > +import org.apache.lucene.index.FieldInfo.IndexOptions; > > import org.apache.lucene.search.DocIdSetIterator; > > import org.apache.lucene.util.AttributeSource; > > import org.apache.lucene.util.Bits; // javadocs > > @@ -47,10 +48,16 @@ > > > > protected DocsEnum() { > > } > > > > - /** Returns term frequency in the current document. Do > > - * not call this before {@link #nextDoc} is first called, > > - * nor after {@link #nextDoc} returns NO_MORE_DOCS. > > - **/ > > + /** > > + * Returns term frequency in the current document, or 1 if the > document > > was > > + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this > before > > + * {@link #nextDoc} is first called, nor after {@link #nextDoc} > returns > > > > + * {@link DocIdSetIterator#NO_MORE_DOCS}. > > + * > > + * <p> > > + * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link > > #FLAG_NONE}, > > + * the result of this method is undefined. > > > > + */ > > public abstract int freq() throws IOException; > > > > /** Returns the related attributes. */ > > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > > =================================================================== > > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > > (revision 1423774) > > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java > > (working copy) > > @@ -21,6 +21,7 @@ > > import java.util.Arrays; > > import java.util.HashSet; > > import java.util.Iterator; > > +import java.util.Random; > > > > import org.apache.lucene.analysis.MockAnalyzer; > > import org.apache.lucene.codecs.Codec; > > @@ -31,7 +32,9 @@ > > import org.apache.lucene.codecs.TermsConsumer; > > import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; > > import org.apache.lucene.document.Document; > > +import org.apache.lucene.document.Field.Store; > > import org.apache.lucene.document.FieldType; > > +import org.apache.lucene.document.StringField; > > import org.apache.lucene.document.TextField; > > import org.apache.lucene.index.FieldInfo.IndexOptions; > > import org.apache.lucene.search.DocIdSetIterator; > > @@ -630,4 +633,33 @@ > > } > > consumer.close(); > > } > > + > > + public void testDocsOnlyFreq() throws Exception { > > + // tests that when fields are indexed with DOCS_ONLY, the Codec > > + // returns 1 in docsEnum.freq() > > + Directory dir = newDirectory(); > > + Random random = random(); > > + IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( > > + TEST_VERSION_CURRENT, new MockAnalyzer(random))); > > + // we don't need many documents to assert this, but don't use one > > document either > > + int numDocs = atLeast(random, 50); > > + for (int i = 0; i < numDocs; i++) { > > + Document doc = new Document(); > > + doc.add(new StringField("f", "doc", Store.NO)); > > + writer.addDocument(doc); > > + } > > + writer.close(); > > + > > + Term term = new Term("f", new BytesRef("doc")); > > + DirectoryReader reader = DirectoryReader.open(dir); > > + for (AtomicReaderContext ctx : reader.leaves()) { > > + DocsEnum de = ctx.reader().termDocsEnum(term); > > + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { > > + assertEquals("wrong freq for doc " + de.docID(), 1, de.freq()); > > + } > > + } > > + reader.close(); > > + > > + dir.close(); > > + } > > } > > > > BTW, I don't know if it should be like that or not, but I ran this test > with > > -Dtests.iters=1000 and printed at the end of each iteration > > Codec.getDefault(). > > For all iterations it printed the same Codec, but if I ran the test many > > times (no iteration), it printed different Codecs in each run. > > I thought that tests.iters should simulate the behavior of running the > test > > many times? And therefore pick a new seed + Codec (among other things) > for > > each iteration? > > > > Shai > > > > > > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless > > <luc...@mikemccandless.com> wrote: > >> > >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote: > >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? > Do > >> > we > >> > have a test that can trip bad Codecs? > >> > >> I'm not sure! We should make a test & fix any failing ones ... > >> > >> > It may be more than just changing the documentation... > >> > >> Right. > >> > >> > Why would e.g. TermQuery need to write specialized code for these > cases? > >> > I > >> > looked at TermScorer, and its freq() just returns docsEnum.freq(). > >> > >> I meant if we did not adopt this spec ("freq() will lie and return 1 > >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would > >> need specialized code. > >> > >> > I think that Similarity may be affected? Which brings the question - > how > >> > do > >> > Similarity impls know what flags the DE was opened with, and shouldn't > >> > they > >> > be specialized? > >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to > score() > >> > as > >> > an index to an array, so clearly it assumes it is >= 0 and also < > >> > scoreCache.length. > >> > So I wonder what will happen to it when someone's Codec will return a > >> > negative value or MAX_INT in case frequencies aren't needed? > >> > >> Well, if you passed FLAGS_NONE when you opened the DE then it's your > >> responsibility to never call freq() ... ie, don't call freq() and pass > >> that to the sim. > >> > >> > I do realize that you shouldn't call Similarity with missing > >> > information, > >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard > it > >> > is > >> > safe. > >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what > >> > you're doing and don't pass a random freq() to Similarity. > >> > >> Right. > >> > >> > I lean towards documenting the spec from above, and ensuring that all > >> > Codecs > >> > return 1 for DOCS_ONLY. > >> > >> +1 > >> > >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie > >> and say freq=1 (need a test verifying this) if the field was indexed > >> as DOCS_ONLY. > >> > >> > If in the future we'll need to handle the case where someone receives > a > >> > DocsEnum which it needs to consume, and doesn't know which flags were > >> > used > >> > to open it, we can always add a getFlags to DE. > >> > >> Yeah ... > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >