[ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935765#action_12935765 ]
Trejkaz commented on LUCENE-2348: --------------------------------- That is exactly the workaround we performed for our own filters, including our private copy of a filter which works like DuplicateFilter. All the ones which need the context now take the reader up-front. The problem now, is that we have to use a different filter instance on each reader. Previously we were caching them globally, and somewhere in the system we are evidently still caching them globally, because one time in a million we find the wrong filter being used on the wrong reader. I am now thinking of making another kind of context-sensitive filter, which can somehow omnisciently know about all readers open in the entire JVM (e.g. we hook the place where we open the top-level reader, and push the information about its structure into some global watch.) I think Robert's comments possibly stem from the misconception that the duplicate filter somehow works like field collapsing. I wrote a test just to illustrate how it actually behaves, just to make sure I wasn't confused myself (since he seemed to think I was...) {code} public class TestDuplicateFilter { IndexReader reader; IndexSearcher searcher; @Before public void setUpSampleData() throws Exception { RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); Document doc; doc = new Document(); doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("text", "a", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("text", "b", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("text", "c", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); writer.close(); reader = IndexReader.open(dir, true); searcher = new IndexSearcher(reader); } @Test public void testHitOnOriginal() throws Exception { Filter filter = new DuplicateFilter("id", DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION); TopDocs docs = searcher.search(new TermQuery(new Term("text", "a")), filter, 3); assertEquals("Expected one hit - matched the original", 1, docs.totalHits); assertEquals("Wrong doc hit", 0, docs.scoreDocs[0].doc); } @Test public void testHitOnCopy() throws Exception { Filter filter = new DuplicateFilter("id", DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION); TopDocs docs = searcher.search(new TermQuery(new Term("text", "b")), filter, 3); // Field collapsing would return one hit here, which would be undesirable: assertEquals("Expected no hits - matched the copy", 0, docs.totalHits); } } {code} > DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment > readers > ------------------------------------------------------------------------------------- > > Key: LUCENE-2348 > URL: https://issues.apache.org/jira/browse/LUCENE-2348 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Affects Versions: 2.9.2 > Reporter: Trejkaz > Attachments: LUCENE-2348.patch, LUCENE-2348.patch > > > DuplicateFilter currently works by building a single doc ID set, without > taking into account that getDocIdSet() will be called once per segment and > only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org