[ 
https://issues.apache.org/jira/browse/SOLR-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243103#comment-15243103
 ] 

Keith Laban commented on SOLR-8988:
-----------------------------------

For clarity of this test:

bq. num shards
- 12

bq. num docs per shard
- ~70 million

bq. num terms in field
- ~15 million

bq. num terms with non-zero facet counts for docs matching query on a per shard 
basis
- ~90k

bq. how much variance there is in the num terms with non-zero facet counts for 
docs matching query on a per shard basis
- evenly distributed 



bq. ...is that if you get back a count of foo=0 from shardA, and if foo winds 
up being a candidate term for the final topN list because of it's count on 
other shards, then you know definitively that you don't have to ask shardA to 
provide a refinement value for "foo" - you already know it's count.

This is the part that I would argue doesn't matter. Consider you asked for 10 
terms from shardA with mincount =1 and you received only 5 terms back. Then you 
know that if foo was in shardB, but not in shardA the maximum count it could 
have had in shardA was 0, otherwise it would have been returned in the initial 
request. 

On the other hand if you ask for 10 terms with mincount=1 and you get back 10 
terms with a count >=1 well the response back would have been identical if 
mincount=0. 

Logic aids refinement pulled from -- {{FacetComponent.DistributedFieldFacet}} 
{code}
    void add(int shardNum, NamedList shardCounts, int numRequested) {
      // shardCounts could be null if there was an exception
      int sz = shardCounts == null ? 0 : shardCounts.size();
      int numReceived = sz;
      
      FixedBitSet terms = new FixedBitSet(termNum + sz);

      long last = 0;
      for (int i = 0; i < sz; i++) {
        String name = shardCounts.getName(i);
        long count = ((Number) shardCounts.getVal(i)).longValue();
        if (name == null) {
          missingCount += count;
          numReceived--;
        } else {
          ShardFacetCount sfc = counts.get(name);
          if (sfc == null) {
            sfc = new ShardFacetCount();
            sfc.name = name;
            sfc.indexed = ftype == null ? sfc.name : ftype.toInternal(sfc.name);
            sfc.termNum = termNum++;
            counts.put(name, sfc);
          }
          sfc.count += count;
          terms.set(sfc.termNum);
          last = count;
        }
      }
      
      // the largest possible missing term is initialMincount if we received
      // less than the number requested.
      if (numRequested < 0 || numRequested != 0 && numReceived < numRequested) {
        last = initialMincount;
      }
      
      missingMaxPossible += last;
      missingMax[shardNum] = last;
      counted[shardNum] = terms;
    }
{code}

However I think this line block should also be changed.
{code}
      if (numRequested < 0 || numRequested != 0 && numReceived < numRequested) {
        last = Math.max(initialMincount-1, 0);
      }
{code}

> Improve facet.method=fcs performance in SolrCloud
> -------------------------------------------------
>
>                 Key: SOLR-8988
>                 URL: https://issues.apache.org/jira/browse/SOLR-8988
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Keith Laban
>         Attachments: SOLR-8988.patch
>
>
> This relates to SOLR-8559 -- which improves the algorithm used by fcs 
> faceting when {{facet.mincount=1}}
> This patch allows {{facet.mincount}} to be sent as 1 for distributed queries. 
> As far as I can tell there is no reason to set {{facet.mincount=0}} for 
> refinement purposes . After trying to make sense of all the refinement logic, 
> I cant see how the difference between _no value_ and _value=0_ would have a 
> negative effect.
> *Test perf:*
> - ~15million unique terms
> - query matches ~3million documents
> *Params:*
> {code}
> facet.mincount=1
> facet.limit=500
> facet.method=fcs
> facet.sort=count
> {code}
> *Average Time Per Request:*
> - Before patch:  ~20seconds
> - After patch: <1 second
> *Note*: all tests pass and in my test, the output was identical before and 
> after patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to