[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Karthick Sankarachary (JIRA) Tue, 22 Jun 2010 18:11:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881502#action_12881502
 ]


Karthick Sankarachary commented on LUCENE-2348:
-----------------------------------------------

{quote}1. If your filterable data is in another store (e.g. a database), then 
you would still need either some way to get to the top level reader or a way to 
know what its offset is, but there is no way to get that information from the 
reader which was passed in.{quote}

In theory, one could obtain the top-level reader from a segment reader as 
follows: IndexReader.open(((SegmentReader) reader).directory()), where reader 
is what is provided to the filter. Of course, the top-level reader that you 
obtain this way might be a little bit "ahead" of the segment reader's actual 
parent, given that it was created more recently. If you think it makes sense, I 
can add a convenience method to the StatefulFilter to obtain the top-level 
reader using this approach. 

{quote}2. If you want to return the newest item instead of the oldest item, it 
will be too late if getStatefulDocIdSet for an earlier call has already 
returned the older one.{quote}

Actually, if you create a DuplicateFilter with keepMode set to 
KM_USE_FIRST_OCCURRENCE, then it will return the document from the first 
matching segment, and ignore the ones in subsequent segments (due to its 
stateful behavior). However, the current approach would break in the event 
keepMode is set to KM_USE_LAST_OCCURRENCE. Again, in theory, if we could 
determine if the reader corresponds to the last segment, then we could defer 
all matches until after the last reader has been processed. Needless to say, 
I'm open to any other suggestions that you might have to address that case.

> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
> readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without 
> taking into account that getDocIdSet() will be called once per segment and 
> only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Reply via email to