+1 On Wed, Dec 19, 2012 at 7:42 AM, Shai Erera <ser...@gmail.com> wrote: > Thanks for catching that Mike. Will change and commit. > > Shai > > > On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> >> +1, but can you change "... if the Document was indexed with >> DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"? >> >> Thanks Shai! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <ser...@gmail.com> wrote: >> > Here's the patch with test: >> > >> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java >> > =================================================================== >> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java >> > (revision >> > 1423774) >> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java (working >> > copy) >> > @@ -19,6 +19,7 @@ >> > >> > import java.io.IOException; >> > >> > +import org.apache.lucene.index.FieldInfo.IndexOptions; >> > import org.apache.lucene.search.DocIdSetIterator; >> > import org.apache.lucene.util.AttributeSource; >> > import org.apache.lucene.util.Bits; // javadocs >> > @@ -47,10 +48,16 @@ >> > >> > protected DocsEnum() { >> > } >> > >> > - /** Returns term frequency in the current document. Do >> > - * not call this before {@link #nextDoc} is first called, >> > - * nor after {@link #nextDoc} returns NO_MORE_DOCS. >> > - **/ >> > + /** >> > + * Returns term frequency in the current document, or 1 if the >> > document >> > was >> > + * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this >> > before >> > + * {@link #nextDoc} is first called, nor after {@link #nextDoc} >> > returns >> > >> > + * {@link DocIdSetIterator#NO_MORE_DOCS}. >> > + * >> > + * <p> >> > + * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link >> > #FLAG_NONE}, >> > + * the result of this method is undefined. >> > >> > + */ >> > public abstract int freq() throws IOException; >> > >> > /** Returns the related attributes. */ >> > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java >> > =================================================================== >> > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java >> > (revision 1423774) >> > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java >> > (working copy) >> > @@ -21,6 +21,7 @@ >> > import java.util.Arrays; >> > import java.util.HashSet; >> > import java.util.Iterator; >> > +import java.util.Random; >> > >> > import org.apache.lucene.analysis.MockAnalyzer; >> > import org.apache.lucene.codecs.Codec; >> > @@ -31,7 +32,9 @@ >> > import org.apache.lucene.codecs.TermsConsumer; >> > import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat; >> > import org.apache.lucene.document.Document; >> > +import org.apache.lucene.document.Field.Store; >> > import org.apache.lucene.document.FieldType; >> > +import org.apache.lucene.document.StringField; >> > import org.apache.lucene.document.TextField; >> > import org.apache.lucene.index.FieldInfo.IndexOptions; >> > import org.apache.lucene.search.DocIdSetIterator; >> > @@ -630,4 +633,33 @@ >> > } >> > consumer.close(); >> > } >> > + >> > + public void testDocsOnlyFreq() throws Exception { >> > + // tests that when fields are indexed with DOCS_ONLY, the Codec >> > + // returns 1 in docsEnum.freq() >> > + Directory dir = newDirectory(); >> > + Random random = random(); >> > + IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( >> > + TEST_VERSION_CURRENT, new MockAnalyzer(random))); >> > + // we don't need many documents to assert this, but don't use one >> > document either >> > + int numDocs = atLeast(random, 50); >> > + for (int i = 0; i < numDocs; i++) { >> > + Document doc = new Document(); >> > + doc.add(new StringField("f", "doc", Store.NO)); >> > + writer.addDocument(doc); >> > + } >> > + writer.close(); >> > + >> > + Term term = new Term("f", new BytesRef("doc")); >> > + DirectoryReader reader = DirectoryReader.open(dir); >> > + for (AtomicReaderContext ctx : reader.leaves()) { >> > + DocsEnum de = ctx.reader().termDocsEnum(term); >> > + while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { >> > + assertEquals("wrong freq for doc " + de.docID(), 1, de.freq()); >> > + } >> > + } >> > + reader.close(); >> > + >> > + dir.close(); >> > + } >> > } >> > >> > BTW, I don't know if it should be like that or not, but I ran this test >> > with >> > -Dtests.iters=1000 and printed at the end of each iteration >> > Codec.getDefault(). >> > For all iterations it printed the same Codec, but if I ran the test many >> > times (no iteration), it printed different Codecs in each run. >> > I thought that tests.iters should simulate the behavior of running the >> > test >> > many times? And therefore pick a new seed + Codec (among other things) >> > for >> > each iteration? >> > >> > Shai >> > >> > >> > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless >> > <luc...@mikemccandless.com> wrote: >> >> >> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <ser...@gmail.com> wrote: >> >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? >> >> > Do >> >> > we >> >> > have a test that can trip bad Codecs? >> >> >> >> I'm not sure! We should make a test & fix any failing ones ... >> >> >> >> > It may be more than just changing the documentation... >> >> >> >> Right. >> >> >> >> > Why would e.g. TermQuery need to write specialized code for these >> >> > cases? >> >> > I >> >> > looked at TermScorer, and its freq() just returns docsEnum.freq(). >> >> >> >> I meant if we did not adopt this spec ("freq() will lie and return 1 >> >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would >> >> need specialized code. >> >> >> >> > I think that Similarity may be affected? Which brings the question - >> >> > how >> >> > do >> >> > Similarity impls know what flags the DE was opened with, and >> >> > shouldn't >> >> > they >> >> > be specialized? >> >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to >> >> > score() >> >> > as >> >> > an index to an array, so clearly it assumes it is >= 0 and also < >> >> > scoreCache.length. >> >> > So I wonder what will happen to it when someone's Codec will return a >> >> > negative value or MAX_INT in case frequencies aren't needed? >> >> >> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your >> >> responsibility to never call freq() ... ie, don't call freq() and pass >> >> that to the sim. >> >> >> >> > I do realize that you shouldn't call Similarity with missing >> >> > information, >> >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard >> >> > it >> >> > is >> >> > safe. >> >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know >> >> > what >> >> > you're doing and don't pass a random freq() to Similarity. >> >> >> >> Right. >> >> >> >> > I lean towards documenting the spec from above, and ensuring that all >> >> > Codecs >> >> > return 1 for DOCS_ONLY. >> >> >> >> +1 >> >> >> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie >> >> and say freq=1 (need a test verifying this) if the field was indexed >> >> as DOCS_ONLY. >> >> >> >> > If in the future we'll need to handle the case where someone receives >> >> > a >> >> > DocsEnum which it needs to consume, and doesn't know which flags were >> >> > used >> >> > to open it, we can always add a getFlags to DE. >> >> >> >> Yeah ... >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org