[
https://issues.apache.org/jira/browse/SOLR-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243103#comment-15243103
]
Keith Laban commented on SOLR-8988:
-----------------------------------
For clarity of this test:
bq. num shards
- 12
bq. num docs per shard
- ~70 million
bq. num terms in field
- ~15 million
bq. num terms with non-zero facet counts for docs matching query on a per shard
basis
- ~90k
bq. how much variance there is in the num terms with non-zero facet counts for
docs matching query on a per shard basis
- evenly distributed
bq. ...is that if you get back a count of foo=0 from shardA, and if foo winds
up being a candidate term for the final topN list because of it's count on
other shards, then you know definitively that you don't have to ask shardA to
provide a refinement value for "foo" - you already know it's count.
This is the part that I would argue doesn't matter. Consider you asked for 10
terms from shardA with mincount =1 and you received only 5 terms back. Then you
know that if foo was in shardB, but not in shardA the maximum count it could
have had in shardA was 0, otherwise it would have been returned in the initial
request.
On the other hand if you ask for 10 terms with mincount=1 and you get back 10
terms with a count >=1 well the response back would have been identical if
mincount=0.
Logic aids refinement pulled from -- {{FacetComponent.DistributedFieldFacet}}
{code}
void add(int shardNum, NamedList shardCounts, int numRequested) {
// shardCounts could be null if there was an exception
int sz = shardCounts == null ? 0 : shardCounts.size();
int numReceived = sz;
FixedBitSet terms = new FixedBitSet(termNum + sz);
long last = 0;
for (int i = 0; i < sz; i++) {
String name = shardCounts.getName(i);
long count = ((Number) shardCounts.getVal(i)).longValue();
if (name == null) {
missingCount += count;
numReceived--;
} else {
ShardFacetCount sfc = counts.get(name);
if (sfc == null) {
sfc = new ShardFacetCount();
sfc.name = name;
sfc.indexed = ftype == null ? sfc.name : ftype.toInternal(sfc.name);
sfc.termNum = termNum++;
counts.put(name, sfc);
}
sfc.count += count;
terms.set(sfc.termNum);
last = count;
}
}
// the largest possible missing term is initialMincount if we received
// less than the number requested.
if (numRequested < 0 || numRequested != 0 && numReceived < numRequested) {
last = initialMincount;
}
missingMaxPossible += last;
missingMax[shardNum] = last;
counted[shardNum] = terms;
}
{code}
However I think this line block should also be changed.
{code}
if (numRequested < 0 || numRequested != 0 && numReceived < numRequested) {
last = Math.max(initialMincount-1, 0);
}
{code}
> Improve facet.method=fcs performance in SolrCloud
> -------------------------------------------------
>
> Key: SOLR-8988
> URL: https://issues.apache.org/jira/browse/SOLR-8988
> Project: Solr
> Issue Type: Improvement
> Reporter: Keith Laban
> Attachments: SOLR-8988.patch
>
>
> This relates to SOLR-8559 -- which improves the algorithm used by fcs
> faceting when {{facet.mincount=1}}
> This patch allows {{facet.mincount}} to be sent as 1 for distributed queries.
> As far as I can tell there is no reason to set {{facet.mincount=0}} for
> refinement purposes . After trying to make sense of all the refinement logic,
> I cant see how the difference between _no value_ and _value=0_ would have a
> negative effect.
> *Test perf:*
> - ~15million unique terms
> - query matches ~3million documents
> *Params:*
> {code}
> facet.mincount=1
> facet.limit=500
> facet.method=fcs
> facet.sort=count
> {code}
> *Average Time Per Request:*
> - Before patch: ~20seconds
> - After patch: <1 second
> *Note*: all tests pass and in my test, the output was identical before and
> after patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]