[ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874529#action_12874529
 ] 

Michael McCandless commented on LUCENE-2348:
--------------------------------------------

bq. What you describe is precisely the problem. It will deduplicate only over 
each segment, not over the text index as one would expect given the name of the 
class.

Duh, right!  You want dedup to apply to the entire index....

Ugh, so this has been broken since the cutover to per-segment searching (2.9.x).

This is tricky to fix.  Somehow DuplicateFilter needs to get ahold of the top 
reader.  It then must run its dup detection against the TermEnum from that top 
reader, but then when requested per sub-reader, it must return a slice into the 
bits for the top reader.

There's no way, now, given a sub-reader to figure out which parent reader it 
belongs to... so I think we'd have to change DuplicateFilter to take in the top 
reader to its ctor?  (But this is sort of messy -- no other core/contrib 
filters have this "state" -- they are normally free to be reused across 
readers).

The only other [big] change I can think of is if we could change the Filter API 
to be more like Scorer, which does first receive the top reader (since it needs 
to init measures like idf across all segments), and then separately steps 
through each sub-reader.

> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
> readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>
> DuplicateFilter currently works by building a single doc ID set, without 
> taking into account that getDocIdSet() will be called once per segment and 
> only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to