Re: Search returning documents matching a NOT range

Ian Lea Mon, 08 Nov 2010 03:46:38 -0800

This does seem extremely odd.  David sent me a copy of his index and
I've played around with it and also written a self-contained RAM index
program, below, that shows the same problem, namely that if the second
index has 1000+ docs the one and only doc in the first index is
incorrectly matched if the search is done with a MultiSearcher.  In
answer to Uwe's question, it works correctly if use a single
IndexSearcher on top of a MultiReader.


Tests run with lucene-core-3.0.2.jar.

Snippet from program output:

Larger index with 999 docs
--- multi reader ---
Query: +author:aaa -pubdate:[aaa TO bbb]
MaxDocs: 1000
Hit count: 0
--- multi searcher ---
Query: +author:aaa -pubdate:[aaa TO bbb]
MaxDocs: 1000
Hit count: 0

Larger index with 1000 docs
--- multi reader ---
Query: +author:aaa -pubdate:[aaa TO bbb]
MaxDocs: 1001
Hit count: 0
--- multi searcher ---
Query: +author:aaa -pubdate:[aaa TO bbb]
MaxDocs: 1001
Hit count: 1
Docno: 0
author: /aaa/, indexed: true
pubdate: /abc/, indexed: true

-----------------------------------------------------------------------
package test;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.Version;

public class LuceneTest8 {

    static public void main(String[] args) throws Exception {
        test(999);
        test(1000);
        test(1001);
    }


    static void test(int _max) throws Exception {
        System.out.printf("\n\nLarger index with %s docs\n", _max);
        Analyzer anl = new StandardAnalyzer(Version.LUCENE_30);
        Directory dir1 = loadIndex(anl, 1, "aaa", "abc");
        Directory dir2 = loadIndex(anl, _max, "zzz", "zzz");
        QueryParser qp = new QueryParser(Version.LUCENE_30, "author", anl);
        String qstr = "author:aaa AND NOT pubdate:[aaa TO bbb]";
        Query q = qp.parse(qstr);
        IndexReader ir1 = IndexReader.open(dir1);
        IndexReader ir2 = IndexReader.open(dir2);
        Searcher searcher1 = new IndexSearcher(ir1);
        Searcher searcher2 = new IndexSearcher(ir2);
        MultiReader mr = new MultiReader(ir1, ir2);
        Searcher searcherm1 = new IndexSearcher(mr);
        MultiSearcher searcherm2 = new MultiSearcher(searcher1, searcher2);
        search(q, searcher1, "small index");
        search(q, searcher2, "larger index");
        search(q, searcherm1, "multi reader");
        search(q, searcherm2, "multi searcher");
    }



    static Directory loadIndex(Analyzer _anl,
                               int _max,
                               String _author,
                               String _pd) throws Exception {
        RAMDirectory dir = new RAMDirectory();
        IndexWriter iw = new IndexWriter(dir,
                                         _anl,
                                         true,
                                         IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i < _max; i++) {
            Document d = new Document();
            d.add(new Field("author", _author,
                            Field.Store.YES, Field.Index.ANALYZED));
            d.add(new Field("pubdate", _pd,
                            Field.Store.YES, Field.Index.ANALYZED));
            iw.addDocument(d);
        }
        iw.close();
        return dir;
    }


    static void search(Query _q,
                       Searcher _searcher,
                       String _what) throws Exception {
        System.out.printf("--- %s ---\n", _what);
        System.out.printf("Query: %s\n", _q.toString());
        System.out.printf("MaxDocs: %s\n", _searcher.maxDoc());
        TopDocs topDocs = _searcher.search(_q, 10);
        System.out.printf("Hit count: %s\n", topDocs.totalHits);
        for (int in = 0; in < topDocs.totalHits; in++) {
            int docno = topDocs.scoreDocs[in].doc;
            Document ldoc = _searcher.doc(docno);
            System.out.printf("Docno: %s\n", docno);
            for (Fieldable f : ldoc.getFields()) {
                System.out.printf("%s: /%s/, indexed: %s\n",
                                  f.name(), f.stringValue(), f.isIndexed());
            }
        }
    }
}


--
Ian.


On Mon, Nov 8, 2010 at 4:32 AM, Uwe Schindler <[email protected]> wrote:
> Does the same happen with a MultiReader on top of both indexes and using a
> single IndexSearcher on top of this MultiReader?
>
> P.S.: How about using NumericField?
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: David Fertig [mailto:[email protected]]
>> Sent: Monday, November 08, 2010 4:21 AM
>> To: [email protected]
>> Subject: RE: Search returning documents matching a NOT range
>>
>> publish_date is a string, formatted as YYYYMMDD, so it string sorting
> should
>> work correctly for this field.
>>
>> The field is indexed as a keyword and the field's value is also stored.
>>
>> I have previously reviewed the terms and optimized the index with luke
>> 1.0.1 to make sure there was no index corruption. It is a very useful
> tool,
>> however it can only open 1 index at a time so I can't reproduce the issue
> with
>> it.
>>
>> At your suggestion I added code to enumerate all terms in the indexes and
>> there are no inconsistencies.
>>
>> The two fields being searched each only have 1 term in the first index (as
>> expected) and those terms are not in the second index.
>>
>> David
>>
>>
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:[email protected]]
>> Sent: Sunday, November 7, 2010 11:12 AM
>> To: [email protected]
>> Subject: Re: Search returning documents matching a NOT range
>>
>> What kind of field is publish_date? And how do you store data there?
>> Is it possible you're getting some date presentation wonkiness in here?
>> One thing that might shed light on your problem is if you enumerated the
>> terms in that field and printed them out rather than the document.get.
> That is,
>> be sure you're getting what's in the index (and thus being searched)
> rather than
>> wha's stored in the document.
>>
>> Luke might get you there faster/easier....
>>
>> Best
>> Erick
>>
>> On Fri, Nov 5, 2010 at 5:18 PM, David Fertig <[email protected]>
>> wrote:
>>
>> > Ian,
>> > Thank you for getting back to me.  No, I do not get a bogus hit from
>> > searching the small index alone.  Also, I do not get a hit if I delete
>> any
>> > more documents from the larger index.
>> >
>> > I have updated my test to use RamDirectory and also print maxDoc() for
>> the
>> > searchables and the searcher, all numbers are as expected.  I have
>> posted
>> > all the code, but did not want to post the indexes due to their size
>> (2.2
>> > meg uncompressed).  I will mail them to anyone who can help.
>> >
>> > Here is the complete latest test code and its output
>> >
>> >
>> >
>> > public class LuceneTest {
>> >    static public void main(String[] args) {
>> >        try {
>> >            QueryParser queryParser = new
>> QueryParser(Version.LUCENE_30,
>> > "author", new KeywordAnalyzer());
>> >            Query query = queryParser.parse("author:bentalcella AND NOT
>> > publish_date:[20100601 TO 20100630]");
>> >            Searchable[] searchables = new Searchable[2];
>> >             RAMDirectory ram1 = new RAMDirectory(new
>> NIOFSDirectory(new
>> > File("/home/dfertig/testIndexes/b1")));
>> >            RAMDirectory ram2 = new RAMDirectory(new NIOFSDirectory(new
>> > File("/home/dfertig/testIndexes/m1")));
>> >            searchables[0] = new IndexSearcher(ram1, true);
>> >            searchables[1] = new IndexSearcher(ram2, true);
>> >            MultiSearcher searcher = new MultiSearcher(searchables);
>> >            System.out.println("MaxDocs for index 1: " +
>> > searchables[0].maxDoc());
>> >            System.out.println("MaxDocs for index 2: " +
>> > searchables[1].maxDoc());
>> >            System.out.println("MaxDocs for MultiSearcher: " +
>> > searcher.maxDoc());
>> >             System.out.println("Query: " + query.toString());
>> >            TopDocs topDocs = searcher.search(query, 10);
>> >            System.out.println("Results: " + topDocs.totalHits);
>> >            for (int in = 0; in < topDocs.totalHits; in++) {
>> >                Document document =
>> searcher.doc(topDocs.scoreDocs[in].doc);
>> >                System.out.println("publish_date: " +
>> > document.get("publish_date"));
>> >            }
>> >            searcher.close();
>> >             ram1.close();
>> >            ram2.close();
>> >         } catch (Exception e) {
>> >            System.out.println(e.getMessage());
>> >            e.printStackTrace();
>> >        }
>> >    }
>> > }
>> >
>> > Output:
>> > MaxDocs for index 1: 1
>> > MaxDocs for index 2: 1000
>> > MaxDocs for MultiSearcher: 1001
>> > Query: +author:bentalcella -publish_date:[20100601 TO 20100630]
>> > Results: 1
>> > publish_date: 20100606
>> >
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Ian Lea [mailto:[email protected]]
>> > Sent: Friday, November 5, 2010 4:57 PM
>> > To: [email protected]
>> > Subject: Re: Search returning documents matching a NOT range
>> >
>> > Do you get the bogus hit on the small index if search that index
>> > alone?  Are you positive it only holds the one doc? Loading the one
>> > doc into a new RAM based index in the test would prove it.
>> >
>> > You are more likely to get help if post a self-contained example -
>> > people can see everything relevant and are more likely to spot a
>> > problem.
>> >
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Thu, Nov 4, 2010 at 4:52 PM, David Fertig <[email protected]>
>> wrote:
>> > > I have an active lucene implementation that has been in place for a
>> > > couple years and was recently upgraded to the 3.02 branch. We are
>> now
>> > > occasionally seeing documents returned from searches that should not
>> be
>> > > returned. I have reduced the code and indexes to the smallest set
>> > > possible where I can still repeat the issue.
>> > >
>> > >
>> > >
>> > > My test cases uses 2 indexes.  These indexes have been
>> rebuilt/optimized
>> > > using Luke 1.0.1 to make them the smallest possible.  One index has
>> 1
>> > > document, which is being returned by the query but should not.   The
>> > > other index has 1000 documents, none of which match the search
>> criteria.
>> > > The query should bring back 0 results, but brings back 1.  I can zip
>> and
>> > > mail the indexes if it would aid in helping track down this issue.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > public class LuceneTest {
>> > >
>> > >    static public void main(String[] args) {
>> > >
>> > >        try {
>> > >
>> > >            QueryParser queryParser = new
>> QueryParser(Version.LUCENE_30,
>> > > "author", new KeywordAnalyzer());
>> > >
>> > >            Query query = queryParser.parse("author:bentalcella AND
>> NOT
>> > > publish_date:[20100601 TO 20100630]");
>> > >
>> > >            Searchable[] searchables = new Searchable[2];
>> > >
>> > >            searchables[0] = new IndexSearcher(new NIOFSDirectory(new
>> > > File("/home/dfertig/testIndexes/b1")), true);
>> > >
>> > >            searchables[1] = new IndexSearcher(new NIOFSDirectory(new
>> > > File("/home/dfertig/testIndexes/m1")), true);
>> > >
>> > >            Searcher searcher = new MultiSearcher(searchables);
>> > >
>> > >            System.out.println("Query: " + query.toString());
>> > >
>> > >            TopDocs topDocs = searcher.search(query, 10);
>> > >
>> > >            System.out.println("Results: " + topDocs.totalHits);
>> > >
>> > >            for (int in = 0; in < topDocs.totalHits; in++) {
>> > >
>> > >                Document document =
>> > > searcher.doc(topDocs.scoreDocs[in].doc);
>> > >
>> > >                System.out.println("publish_date: " +
>> > > document.get("publish_date"));
>> > >
>> > >            }
>> > >
>> > >            searcher.close();
>> > >
>> > >        } catch (Exception e) {
>> > >
>> > >            System.out.println(e.getMessage());
>> > >
>> > >            e.printStackTrace();
>> > >
>> > >        }
>> > >
>> > >    }
>> > >
>> > > }
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Search returning documents matching a NOT range

Reply via email to