[ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934214#action_12934214 ]
Trejkaz commented on LUCENE-2348: --------------------------------- Field collapsing has different semantics which don't match those of DuplicateFilter. It's useful if you want to collapse two hits down to one hit, but it doesn't work if you are using DuplicateFilter to filter out previous copies of a document (whether you are working around the issue of Lucene shifting doc IDs when deleting, or simply want to keep the history in case you need it later.) In this situation you want all but one filtered out, whether the one that matches the query matches the filter or not. Initially this might not seem like removing duplicates, but it really is, since you're just removing duplicates based on the "id" field. Similarly, I'm not sure how using a collector would help. There is even a note in HitCollector saying not to look at the document during collection because it will reduce performance by an order of magnitude or more. If you have to look at a field, then you have to look at the document. FieldCache was introduced to try and avoid this, but in practice, it doesn't work once you have tens of millions of documents in your index, unless you have an extraordinary amount of RAM allocated to the JVM (and not every application is a server application!) Even supposing you were willing to take the performance hit, or had a system where you had enough RAM to store the field cache, the collector only receives the ID of the document that hit, it doesn't provide any of the context you need to see which other documents had the same value in the field. > DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment > readers > ------------------------------------------------------------------------------------- > > Key: LUCENE-2348 > URL: https://issues.apache.org/jira/browse/LUCENE-2348 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Affects Versions: 2.9.2 > Reporter: Trejkaz > Attachments: LUCENE-2348.patch, LUCENE-2348.patch > > > DuplicateFilter currently works by building a single doc ID set, without > taking into account that getDocIdSet() will be called once per segment and > only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org