Re: removing duplicate Documents from Hits

Doug Cutting Fri, 01 Oct 2004 12:24:24 -0700

Timm, Andy (ETW) wrote:

Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...). My index needs to have each of these records as 'B' is a searchable field in the index. However, after the query is executed, I want my resulting Hits on be unique on 'A'. I'm only returning the Oracle object ID, so once I've seen it once I don't need it again. It looks like some sort of custom Filter is in order.

I'd suggest a HitCollector that uses a FieldCache of the "A" values to check for duplicates, and collect only a the best document id for each value of "A". This would use a bit of RAM, but be very fast.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: removing duplicate Documents from Hits

Reply via email to