Re: Deduplication of search result with custom with custom sort

Dmitry Emets Mon, 12 Oct 2020 05:02:47 -0700

Thank you very much for helping!

There isn't much I can add about my use case. I have user-generated video
titles and hash codes by which I can understand that these are the same
videos. Users search videos by title and I should return the top 1000
unique videos to them.


I will try to use grouping without counting groups. Otherwise I'll look
here https://issues.apache.org/jira/browse/SOLR-11831 or here
https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html

Thanks again!

пт, 9 окт. 2020 г. в 18:57, Jigar Shah <[email protected]>:

> My learnings dealing this problem
>
> We faced a similar problem before, and did the following things:
>
> 1) Don't request totalGroupCount, and the response was fast. as computing
> group count is an expensive task. If you can live without groupCount.
> Although you can approximate pagination up to total count and then group
> count will be less so when you get empty results you stop pagination.
> 2) Have more shards, so you can get the best out of parallel execution.
>
> I have seen use-cases of  60M total documents dedup doc values field, with
> 4 shards.
>
> Query time SLA is around 5-6 seconds. Not unbearable for users.
>
> Let me know if you find better solution.
>
>
>
>
>
>
> On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> [email protected]> wrote:
>
> > As Erick said, can you tell us a bit more about the use case?
> > There might be another way to achieve the same result.
> >
> > What are these documents?
> > Why do you need 1000 docs per user?
> >
> >
> > From: [email protected] At: 10/09/20 14:25:02To:
> > [email protected]
> > Subject: Re: Deduplication of search result with custom with custom sort
> >
> > 6_500_000 is the total count of groups in the entire collection. I only
> > return the top 1000 to users.
> > I use Lucene where I have documents that can have the same docvalue, and
> I
> > want to deduplicate this documents by this docvalue during search.
> > Also, i sort my documents by multiple fields and because of this i can`t
> > use DiversifiedTopDocsCollector that works with relevance score only.
> >
> > пт, 9 окт. 2020 г. в 16:02, Erick Erickson <[email protected]>:
> >
> > > This is going to be fairly painful. You need to keep a list 6.5M
> > > items long, sorted.
> > >
> > > Before diving in there, I’d really back up and ask what the use-case
> > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > some kind of analytics maybe? In which case, and again
> > > assuming you’re using Solr, Streaming Aggregation might
> > > be a better option.
> > >
> > > This really sounds like an XY problem. You’re trying to solve problem X
> > > and asking how to accomplish it with Y. What I’m questioning
> > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > you explained X there’d be a better suggestion.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <[email protected]> wrote:
> > > >
> > > > I have 12_000_000 documents, 6_500_000 groups
> > > >
> > > > With sort: It takes around 1 sec without grouping, 2 sec with
> grouping
> > > and
> > > > 12 sec with setAllGroups(true)
> > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > > > grouping and 10 sec with setAllGroups(true)
> > > >
> > > > Thank you, Erick, I will look into it
> > > >
> > > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <[email protected]
> >:
> > > >
> > > >> At the Solr level, CollapsingQParserPlugin see:
> > > >>
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > >>
> > > >> You could perhaps steal some ideas from that if you
> > > >> need this at the Lucene level.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > >> [email protected]> wrote:
> > > >>>
> > > >>> Is the field that you are using to dedupe stored as a docvalue?
> > > >>>
> > > >>> From: [email protected] At: 10/09/20 12:18:04To:
> > > >> [email protected]
> > > >>> Subject: Deduplication of search result with custom with custom
> sort
> > > >>>
> > > >>> Hi,
> > > >>> I need to deduplicate search results by specific field and I have
> no
> > > idea
> > > >>> how to implement this properly.
> > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > > expected
> > > >>> results, but has not very good performance.
> > > >>> I think that I need something like DiversifiedTopDocsCollector, but
> > > >>> suitable for collecting TopFieldDocs.
> > > >>> Is there any possibility to achieve deduplication with existing
> > lucene
> > > >>> components, or do I need to implement my own
> > > >> DiversifiedTopFieldsCollector?
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> >
> >
>

Re: Deduplication of search result with custom with custom sort

Reply via email to