Re: Deduplication of search result with custom with custom sort

Erick Erickson Fri, 09 Oct 2020 06:02:35 -0700

This is going to be fairly painful. You need to keep a list 6.5M
items long, sorted.


Before diving in there, I’d really back up and ask what the use-case
is. Returning 6.5M docs to a user is useless, so are you’re doing
some kind of analytics maybe? In which case, and again
assuming you’re using Solr, Streaming Aggregation might
be a better option.

This really sounds like an XY problem. You’re trying to solve problem X
and asking how to accomplish it with Y. What I’m questioning
is whether Y (grouping) is a good approach or not. Perhaps if
you explained X there’d be a better suggestion.

Best,
Erick

> On Oct 9, 2020, at 8:19 AM, Dmitry Emets <[email protected]> wrote:
> 
> I have 12_000_000 documents, 6_500_000 groups
> 
> With sort: It takes around 1 sec without grouping, 2 sec with grouping and
> 12 sec with setAllGroups(true)
> Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> grouping and 10 sec with setAllGroups(true)
> 
> Thank you, Erick, I will look into it
> 
> пт, 9 окт. 2020 г. в 14:32, Erick Erickson <[email protected]>:
> 
>> At the Solr level, CollapsingQParserPlugin see:
>> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>> 
>> You could perhaps steal some ideas from that if you
>> need this at the Lucene level.
>> 
>> Best,
>> Erick
>> 
>>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
>> [email protected]> wrote:
>>> 
>>> Is the field that you are using to dedupe stored as a docvalue?
>>> 
>>> From: [email protected] At: 10/09/20 12:18:04To:
>> [email protected]
>>> Subject: Deduplication of search result with custom with custom sort
>>> 
>>> Hi,
>>> I need to deduplicate search results by specific field and I have no idea
>>> how to implement this properly.
>>> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
>>> results, but has not very good performance.
>>> I think that I need something like DiversifiedTopDocsCollector, but
>>> suitable for collecting TopFieldDocs.
>>> Is there any possibility to achieve deduplication with existing lucene
>>> components, or do I need to implement my own
>> DiversifiedTopFieldsCollector?
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Deduplication of search result with custom with custom sort

Reply via email to