4.0 and 4.1 FieldCacheImpl.DocTermsImpl.exists(docid) possibly broken

2013-07-18 Thread Doron Cohen
Hi, just an FYI - may be helpful for anyone obliged to use 4.0.0 or 4.1.0 -
it seems that this method is actually doing the opposite of its intention.

I did not find mentions of this in the lists or elsewhere.

This is the code for o.a.l.search.FieldCacheImpl.DocTermsImpl.exists(int):
public boolean exists(int docID) {
  return docToOffset.get(docID) == 0;
}

Its description says: Returns true if this doc has this field and is not
deleted.
But it returns true for docs not containing the field and false for those
that do contain it.

A simple workaround is to not to call this method before calling getTerm()
but rather just rely on getTerm()  logic: ... returns the same BytesRef,
or an empty (length=0) BytesRef if the doc did not have this field or was
deleted.

So usage code can be like this:
DocTerms values =  FieldCache.DEFAULT.getTerms(reader, FIELD_NAME);
 BytesRef term = new BytesRef();
for (int docid=0; docidvalues.size(); docid++) {
term = values.getTerm(docid, term);
 if (term.length0) {
doSomethingWith(term.utf8ToString());
}
 }
FieldCache.DEFAULT.purge(reader);

I am not sure about the overhead of this comparing to first checking
exists(), but it at least work correctly.

The code for exists() was as above until R1442497 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?revision=1442497view=markup)
and then in R1443717 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?r1=1442497r2=1443717diff_format=h)
the API was change as part of LUCENE-4547 (DocValues improvements) which
was included in 4.2.

Simple code to demonstrate this (here with 4.1 but same results with 4.0):

RAMDirectory d = new RAMDirectory();
IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_41,
new SimpleAnalyzer(Version.LUCENE_41)));
 w.addDocument(new Document()); // Empty doc (0, 0)
 Document doc = new Document(); // Real doc (1, 1)
doc.add(new StringField(f1, v1, Store.NO));
w.addDocument(doc);
 w.addDocument(new Document()); // Empty doc (2, 2)
w.addDocument(new Document()); // Empty doc (3, 3)
w.commit(); // Commit - so we'll have two atomic readers
 doc = new Document(); // RealDoc (0, 4)
doc.add(new StringField(f1, v2, Store.NO));
w.addDocument(doc);
w.addDocument(new Document()); // Empty doc (1, 5)
w.close();

IndexReader r = DirectoryReader.open(d);
BytesRef br = new BytesRef();
for (AtomicReaderContext leaf : r.leaves()) {
System.out.println(--- new atomic reader);
AtomicReader reader = leaf.reader();
DocTerms a = FieldCache.DEFAULT.getTerms(reader, f1);
for (int i = 0; i  reader.maxDoc(); ++i) {
int n = leaf.docBase + i;
System.out.println(n+ exists: +a.exists(i));
br = a.getTerm(i, br);
if (br.length  0) {
System.out.println(n +   +br.utf8ToString());
}
}
}

The result printing:

  --- new atomic reader
  0 exists: true
  1 exists: false
  1  v1
  2 exists: true
  3 exists: true
  --- new atomic reader
 4 exists: false
 4  v2
 5 exists: true

Indeed, exists() results are wrong.

So again, just an FYI, as this API no longer exists...

Regards,
Doron


Re: 4.0 and 4.1 FieldCacheImpl.DocTermsImpl.exists(docid) possibly broken

2013-07-18 Thread Michael McCandless
Thanks Doron, that's definitely completely backwards!!

Good thing the API is gone.


Mike McCandless

http://blog.mikemccandless.com


On Thu, Jul 18, 2013 at 7:50 AM, Doron Cohen cdor...@gmail.com wrote:
 Hi, just an FYI - may be helpful for anyone obliged to use 4.0.0 or 4.1.0 -
 it seems that this method is actually doing the opposite of its intention.

 I did not find mentions of this in the lists or elsewhere.

 This is the code for o.a.l.search.FieldCacheImpl.DocTermsImpl.exists(int):
 public boolean exists(int docID) {
   return docToOffset.get(docID) == 0;
 }

 Its description says: Returns true if this doc has this field and is not
 deleted.
 But it returns true for docs not containing the field and false for those
 that do contain it.

 A simple workaround is to not to call this method before calling getTerm()
 but rather just rely on getTerm()  logic: ... returns the same BytesRef, or
 an empty (length=0) BytesRef if the doc did not have this field or was
 deleted.

 So usage code can be like this:
 DocTerms values =  FieldCache.DEFAULT.getTerms(reader, FIELD_NAME);
 BytesRef term = new BytesRef();
 for (int docid=0; docidvalues.size(); docid++) {
 term = values.getTerm(docid, term);
 if (term.length0) {
 doSomethingWith(term.utf8ToString());
 }
 }
 FieldCache.DEFAULT.purge(reader);

 I am not sure about the overhead of this comparing to first checking
 exists(), but it at least work correctly.

 The code for exists() was as above until R1442497
 (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?revision=1442497view=markup)
 and then in R1443717
 (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?r1=1442497r2=1443717diff_format=h)
 the API was change as part of LUCENE-4547 (DocValues improvements) which was
 included in 4.2.

 Simple code to demonstrate this (here with 4.1 but same results with 4.0):

 RAMDirectory d = new RAMDirectory();
 IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_41,
 new SimpleAnalyzer(Version.LUCENE_41)));
 w.addDocument(new Document()); // Empty doc (0, 0)
 Document doc = new Document(); // Real doc (1, 1)
 doc.add(new StringField(f1, v1, Store.NO));
 w.addDocument(doc);
 w.addDocument(new Document()); // Empty doc (2, 2)
 w.addDocument(new Document()); // Empty doc (3, 3)
 w.commit(); // Commit - so we'll have two atomic readers
 doc = new Document(); // RealDoc (0, 4)
 doc.add(new StringField(f1, v2, Store.NO));
 w.addDocument(doc);
 w.addDocument(new Document()); // Empty doc (1, 5)
 w.close();

 IndexReader r = DirectoryReader.open(d);
 BytesRef br = new BytesRef();
 for (AtomicReaderContext leaf : r.leaves()) {
 System.out.println(--- new atomic reader);
 AtomicReader reader = leaf.reader();
 DocTerms a = FieldCache.DEFAULT.getTerms(reader, f1);
 for (int i = 0; i  reader.maxDoc(); ++i) {
 int n = leaf.docBase + i;
 System.out.println(n+ exists: +a.exists(i));
 br = a.getTerm(i, br);
 if (br.length  0) {
 System.out.println(n +   +br.utf8ToString());
 }
 }
 }

 The result printing:

   --- new atomic reader
   0 exists: true
   1 exists: false
   1  v1
   2 exists: true
   3 exists: true
   --- new atomic reader
  4 exists: false
  4  v2
  5 exists: true

 Indeed, exists() results are wrong.

 So again, just an FYI, as this API no longer exists...

 Regards,
 Doron

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org