RE: RE: Question about grouping in distribute mode

Ian Caldwell Thu, 06 Apr 2017 19:34:31 -0700

Yes, It looks that each shard will return the total count of groups in that 
shard and would give a higher number if the same group is in more than one 
shard.


The other problem is when merging the group data the counts and total count are 
being held in Integers that can overflow resulting in negative numbers.(solr 
5.5.3)

Ian
NLA
From: 380382...@qq.com [mailto:380382...@qq.com]
Sent: Friday, 7 April 2017 11:50 AM
To: dev <dev@lucene.apache.org>
Subject: Re: RE: Question about grouping in distribute mode

thank you
i think it is only use shard1.groupNumber add shard2.groupNumber。but groupA may 
also in shar1 and shard2. so the group.ngroup always bigger than the realy 
number?
________________________________
380382...@qq.com<mailto:380382...@qq.com>

From: Ian Caldwell<mailto:icaldw...@nla.gov.au>
Date: 2017-04-07 09:32
To: 'dev@lucene.apache.org'<mailto:dev@lucene.apache.org>
Subject: RE: Re: Question about grouping in distribute mode
I think the this happens because the First Pass gets the top nGroups and holds 
the shards that they came from,
then for the second pass it is only searching the shards that contributed to 
the list instead of searching all shards.

So if searching for the top 10 groups a shard may have data from that group but 
it is ranked 11th (outside the top 10) then this shard is left off the list for 
the second pass.

Searching(for 3 groups) could return
From GROUPING_DISTRIBUTED_FIRST
shard1: groupA, groupB & groupC      (groupD ranked 4th so not returned in the 
list)
shard2: groupA, groupC & groupD

After merging, the top groups would be groupA, groupC & groupD

From GROUPING_DISTRIBUTED_SECOND
Shard1:
groupA: doc1, doc3 & doc5
groupC: doc 11, doc13 & doc15
groupD: doc111, doc113 & doc115
Shard2:
groupA: doc2, doc4 & doc6
groupC: doc12, doc14 & doc16
groupD: doc112, doc114 & doc116.

So you need to do the second pass against all shards for the top docs so that 
you don’t miss the docs from groupD in shard1.



Ian
NLA


-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, 7 April 2017 1:16 AM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: Re: Re: Question about grouping in distribute mode

from the reference guide:

group.ngroups and group.facet require that all documents in each group must be 
co-located on the same shard in order for accurate counts to be returned.

Can't give you a technical reason, but there's no expectation it is supported 
with composite ID routing.

Best,
Erick

On Thu, Apr 6, 2017 at 2:52 AM, 380382...@qq.com<mailto:380382...@qq.com> 
<380382...@qq.com<mailto:380382...@qq.com>> wrote:
> thank for your help
> when i use compseId route ,i find the group.ngroup is a wrong number.
> I would like to know what implementation mechanism has led to this
> happening。why  we must use implict route when we want to use the group
> correctly
>
> ________________________________
> 380382...@qq.com<mailto:380382...@qq.com>
>
>
> From: Diego Ceccarelli (BLOOMBERG/ LONDON)
> Date: 2017-04-06 17:16
> To: 380382856
> Subject: Re: Re: Question about grouping in distribute mode Dear
> 380382856, I would be happy to help you if you can provide more
> informations, do you want to know why grouping implements a specific
> route strategy? My point is that usually grouping involves 3
> communications between the federator and the shards, but in case of
> ngroup=1 it would be possible to obtain the same result with 2
> communications.
>
> Can I please ask to post your question on the user solr mailing list
> [1]? in this way my answer will be useful to all solr users and people
> more expert than me can also answer (or correct me if I say something
> wrong :))
>
> Have a good day!
> Diego
>
> [1] http://lucene.apache.org/solr/community.html#mailing-lists-irc
>
>
> From: 380382...@qq.com<mailto:380382...@qq.com> At: 04/06/17 08:38:20
> To: DIEGO CECCARELLI (BLOOMBERG/ LONDON)
> Subject: Re: Re: Question about grouping in distribute mode
>
> hello can you help me?
> There is a problem that has been bothering me.why solrcloud use
> group.ngroup shoud implements implict route stratege?
> 380382...@qq.com<mailto:380382...@qq.com>
>
>
> From: Diego Ceccarelli (BLOOMBERG/ LONDON)
> Date: 2017-03-30 22:09
> To: dev
> Subject: Re: Question about grouping in distribute mode Yes, I agree.
> And if there are not problems with the logic it would improve the
> performance in both the cases..
>
> From: dev@lucene.apache.org<mailto:dev@lucene.apache.org> At: 03/30/17 
> 14:59:31
> To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
> Subject: Re: Question about grouping in distribute mode
>
> This is also the case for non-distributed, isn’t it?  The lucene-level
> FirstPassGroupingCollector doesn’t actually record the docid of the
> top doc for each group at the moment, but I don’t think there’s any
> reason it couldn’t - it’s stored in the relevant FieldComparator.  And
> it would be a nice shortcut in GroupingSearch more generally.
>
> Alan Woodward
> www.flax.co.uk<http://www.flax.co.uk>
>
>
> On 30 Mar 2017, at 14:26, Diego Ceccarelli
> <diego.ceccare...@gmail.com<mailto:diego.ceccare...@gmail.com>>
> wrote:
>
> Hello, I'm currently working on Solr grouping in order to support
> reranking [1].
> I've a working patch for non distribute search, and I'm now working on
> the distribute setting.
>
> Looking at the code of distribute grouping (top-k groups, top-n
> documents for each group) search consists in:
>
> GROUPING_DISTRIBUTED_FIRST
> 1. given the grouping query, each shard will return the top-k groups
> 2. federator will merge the top-k groups and will produce the top-k
> groups for the query
>
> GROUPING_DISTRIBUTED_SECOND
> 1. given the top-k groups  each shard will return its top-n documents
> for each group.
> 2. federator will then compute top-n documents for each group merging
> all the shards responses.
>
> GET_FIELDS
> as usual
>
> My plan was to change the collector in GROUPING_DISTRIBUTED_SECOND,
> and return the top documents for each group with a new score given by
> the function used to rerank (affecting maxScore for each group and
> then also the order of the groups).
> Looking at the code then I realized that TopGroups asserts that order
> of the groups is not changing, and I realized that indeed _ if the
> ranking function is the same, group order can't change after the first
> stage _.
>
> My question is: if the user is interested only in the top document for
> each group (i.e., the default: group.limit = 1) do we really need
> GROUPING_DISTRIBUTED_SECOND, or could we skip it?
> is there any reason to perform grouping distributed second in this
> case? or we could just return the top docid together with the
> topgroups in GROUPING_DISTRIBUTED_FIRST and then go directly to GET_FIELDS?
>
> Cheers,
> Diego
>
> [1] https://issues.apache.org/jira/browse/SOLR-8542
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<mailto:dev-unsubscr...@lucene.apache.org> For 
additional commands, e-mail: 
dev-h...@lucene.apache.org<mailto:dev-h...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<mailto:dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail: 
dev-h...@lucene.apache.org<mailto:dev-h...@lucene.apache.org>

RE: RE: Question about grouping in distribute mode

Reply via email to