Re: Deduplication of search result with custom with custom sort

Dmitry Emets Tue, 13 Oct 2020 02:41:40 -0700

I studied the Las Vegas patch and got one simple thought.
FirstPassingGroupCollector collects CollectedSearchGroup inside itself.
CollectedSearchGroup contains docId and sortValues. This is exactly what I
need. Thanks for the help!


пн, 12 окт. 2020 г. в 17:38, Diego Ceccarelli (BLOOMBERG/ LONDON) <
[email protected]>:

> > https://issues.apache.org/jira/browse/SOLR-11831 I collaborated on Las
> Vegas patch, I don't think that patch will be merged - it modifies too many
> things in the core - we ended up reimplementing it as a standalone plugin.
> Also keep in mind that the patch makes the difference only if you are
> using Solr Cloud, while it seems that you are using lucene.
>
> Do you really need to return 1000 results to the user? is this for paging
> purposes?
>
> Do you know how frequent are the groups? if they are not too frequent and
> you are not strict on 1000, you might retrieve more let's say 2000 without
> grouping and then do the deduping after..
>
> Cheers,
> Diego
>
>
> From: [email protected] At: 10/12/20 13:02:46To:
> [email protected]
> Subject: Re: Deduplication of search result with custom with custom sort
>
> Thank you very much for helping!
>
> There isn't much I can add about my use case. I have user-generated video
> titles and hash codes by which I can understand that these are the same
> videos. Users search videos by title and I should return the top 1000
> unique videos to them.
>
> I will try to use grouping without counting groups. Otherwise I'll look
> here https://issues.apache.org/jira/browse/SOLR-11831 or here
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>
> Thanks again!
>
> пт, 9 окт. 2020 г. в 18:57, Jigar Shah <[email protected]>:
>
> > My learnings dealing this problem
> >
> > We faced a similar problem before, and did the following things:
> >
> > 1) Don't request totalGroupCount, and the response was fast. as computing
> > group count is an expensive task. If you can live without groupCount.
> > Although you can approximate pagination up to total count and then group
> > count will be less so when you get empty results you stop pagination.
> > 2) Have more shards, so you can get the best out of parallel execution.
> >
> > I have seen use-cases of  60M total documents dedup doc values field,
> with
> > 4 shards.
> >
> > Query time SLA is around 5-6 seconds. Not unbearable for users.
> >
> > Let me know if you find better solution.
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > [email protected]> wrote:
> >
> > > As Erick said, can you tell us a bit more about the use case?
> > > There might be another way to achieve the same result.
> > >
> > > What are these documents?
> > > Why do you need 1000 docs per user?
> > >
> > >
> > > From: [email protected] At: 10/09/20 14:25:02To:
> > > [email protected]
> > > Subject: Re: Deduplication of search result with custom with custom
> sort
> > >
> > > 6_500_000 is the total count of groups in the entire collection. I only
> > > return the top 1000 to users.
> > > I use Lucene where I have documents that can have the same docvalue,
> and
> > I
> > > want to deduplicate this documents by this docvalue during search.
> > > Also, i sort my documents by multiple fields and because of this i
> can`t
> > > use DiversifiedTopDocsCollector that works with relevance score only.
> > >
> > > пт, 9 окт. 2020 г. в 16:02, Erick Erickson <[email protected]>:
> > >
> > > > This is going to be fairly painful. You need to keep a list 6.5M
> > > > items long, sorted.
> > > >
> > > > Before diving in there, I’d really back up and ask what the use-case
> > > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > > some kind of analytics maybe? In which case, and again
> > > > assuming you’re using Solr, Streaming Aggregation might
> > > > be a better option.
> > > >
> > > > This really sounds like an XY problem. You’re trying to solve
> problem X
> > > > and asking how to accomplish it with Y. What I’m questioning
> > > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > > you explained X there’d be a better suggestion.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <[email protected]>
> wrote:
> > > > >
> > > > > I have 12_000_000 documents, 6_500_000 groups
> > > > >
> > > > > With sort: It takes around 1 sec without grouping, 2 sec with
> > grouping
> > > > and
> > > > > 12 sec with setAllGroups(true)
> > > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec
> with
> > > > > grouping and 10 sec with setAllGroups(true)
> > > > >
> > > > > Thank you, Erick, I will look into it
> > > > >
> > > > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <
> [email protected]
> > >:
> > > > >
> > > > >> At the Solr level, CollapsingQParserPlugin see:
> > > > >>
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > > >>
> > > > >> You could perhaps steal some ideas from that if you
> > > > >> need this at the Lucene level.
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON)
> <
> > > > >> [email protected]> wrote:
> > > > >>>
> > > > >>> Is the field that you are using to dedupe stored as a docvalue?
> > > > >>>
> > > > >>> From: [email protected] At: 10/09/20 12:18:04To:
> > > > >> [email protected]
> > > > >>> Subject: Deduplication of search result with custom with custom
> > sort
> > > > >>>
> > > > >>> Hi,
> > > > >>> I need to deduplicate search results by specific field and I have
> > no
> > > > idea
> > > > >>> how to implement this properly.
> > > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > > > expected
> > > > >>> results, but has not very good performance.
> > > > >>> I think that I need something like DiversifiedTopDocsCollector,
> but
> > > > >>> suitable for collecting TopFieldDocs.
> > > > >>> Is there any possibility to achieve deduplication with existing
> > > lucene
> > > > >>> components, or do I need to implement my own
> > > > >> DiversifiedTopFieldsCollector?
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: [email protected]
> > > > >> For additional commands, e-mail: [email protected]
> > > > >>
> > > > >>
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
>

Re: Deduplication of search result with custom with custom sort

Reply via email to