This is possible using contrib's DuplicateFilter.
Below is an example of your problem defined as an XML-based test which I just
ran OK through my test writer/runner.
Hopefully this is readable and demonstrates the use of
FilteredQuery/DuplicateFilter.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<Test description="DuplicateFilter tests">
<Data>
<Index name="index1">
<Analyzers
class="org.apache.lucene.analysis.standard.StandardAnalyzer29"> </Analyzers>
<Shard name="shard1">
<Document pk="1">
<Field name="text">This is my test</Field>
<Field name="md5">abc</Field>
</Document>
<Document pk="2">
<Field name="text">This is my test</Field>
<Field name="md5">abc</Field>
</Document>
<Document pk="3">
<Field name="text">Another test</Field>
<Field name="md5">def</Field>
</Document>
<Document pk="4">
<Field name="text">This is my test</Field>
<Field name="md5">abc</Field>
</Document>
</Shard>
</Index>
</Data>
<Tests>
<Test description="Eliminate duplicates based on MD5 field">
<Query>
<FilteredQuery>
<Query>
<UserQuery fieldName="text">test</UserQuery>
</Query>
<Filter>
<DuplicateFilter fieldName="md5"/>
</Filter>
</FilteredQuery>
</Query>
<ExpectedResults>
<Result fieldName="pk">1</Result>
<Result fieldName="pk">3</Result>
</ExpectedResults>
</Test>
</Tests>
</Test>
----- Original Message ----
From: Mark <[email protected]>
To: [email protected]
Sent: Thu, 10 March, 2011 15:35:22
Subject: Re: Detecting duplicates
My understanding is It can mark documents with the same signature
indicating that they are similar however there is no way at query time
to return only 1 "unique" document per signature. Am I missing something?
Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test
If I run a query for "test" it should return
Doc 1) This is my test
Doc 3) Another test
On 3/10/11 6:25 AM, Grant Ingersoll wrote:
> On Mar 5, 2011, at 8:35 PM, Mark wrote:
>
>> I'm familiar with Deduplication however I do not wish to remove my
>> duplicates
>>and my needs are slightly different. I would like to mark the first document
>>with signature 'xyz' as unique but the next one as a duplicate. This way I
>>can
>>filter out "duplicates" during searching using a filter query but still
>>return
>>the original document.
> My understanding is that you can have it mark duplicates.
>
>> The only thing I know of at the moment is to use field collapsing but I
>> tried
>>the patch on 1.4.1 and it was terribly slow.
>>
>> On 3/5/11 4:43 AM, Grant Ingersoll wrote:
>>> See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to
>>> pull
>>>out if you are doing just Lucene.
>>>
>>> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>>>
>>>> Is there a way one could detect duplicates (say by using some unique hash
>>>> of
>>>>certain fields) and marking a document as a duplicate but not remove it.
>>>>
>>>> Here is an example:
>>>>
>>>> Doc 1) This is my test
>>>> Doc 2) This is my test
>>>> Doc 3) Another test
>>>> Doc 4) This is my test
>>>>
>>>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked
>>>> as
>>>>duplicates (of doc 1).
>>>>
>>>> Can this be easily accomplished?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]