Hello all,
I am looking for a strategy to exclude duplicate entries when searching multiple indexes which may contain the same document. I have an email system which archives and indexes emails on a per-recipient basis. So, each email recipient has their own index. In the case where the same email is delivered to more than one recipient, each recipient's index stores a record of effectively the same document. Now, there is a requirement to perform a search across multiple indexes, for which I am using the ParallelMultiSearcher. The problem is that this results in duplicate entries in the Hits returned. I can easily transfer the results into some form of java.util.Set to guarantee uniqueness, however I have a problem with the length() of the Hits object returned. Ideally I need a way of filtering the Hits based on a "no duplicate" rule. I am aware of the Filter object however the unique identifier of my document is a field within the lucene document itself (messageid); and I am reluctant to access this field using the public API for every Hit as I fear it will have drastic performance implications.
The ideal solution for me would be to specify a field during the search which is guaranteed to be unique across the Hits returned. Anyone know of an elegant way to do this? Alternatively is there a way I can de-dupe the list myself without loading every document?
Apologies for the length of this question.
P.S. The separation of indexes per-recipient is a mandatory requirement.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
