[
https://issues.apache.org/jira/browse/SOLR-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Gibney reassigned SOLR-15760:
-------------------------------------
Assignee: Michael Gibney
> Improve default distributed facet overrequest function/heuristic
> ----------------------------------------------------------------
>
> Key: SOLR-15760
> URL: https://issues.apache.org/jira/browse/SOLR-15760
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: main (9.0)
> Reporter: Michael Gibney
> Assignee: Michael Gibney
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In {{FacetFieldProcessor}} for distributed requests, additive
> {{facet.overrequest}} can be specified as an integer. In the absence of an
> explicitly specified value, a default overrequest is calculated according to
> the following function/heuristic:
> {code:java}
> if (fcontext.isShard()) {
> if (freq.overrequest == -1) {
> // add over-request if this is a shard request and if we have a
> small offset (large offsets will already be gathering many more buckets than
> needed)
> if (freq.offset < 10) {
> effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default:
> add 10% plus 4 (to overrequest for very small limits)
> }
> }...
> } ...
> {code}
> The logic of overrequesting is: there is some balance of redundant vs. unique
> values values across different shards in a distributed context, with uneven
> distribution across shards. This can result in a situation where values with
> uniform distribution across shards may be erroneously excluded from results,
> in favor of values that may occur less frequently overall, but that (due to
> uneven distribution) occur very frequently on some nodes and not on others.
> "Refinement" is designed to ensure that all top-level values are informed by
> data from all shards -- but in the situation that overrequesting is designed
> to mitigate, refinement can't help for values that are excluded even from the
> initial request phase.
> The main issue is casting a wide-enough initial net. As such, the problem is
> more acute (calling for more of an "overrequest boost") when requesting
> relatively small numbers of values.
> The problem with the existing code is that iiuc, there is no reason from the
> point of view of the overrequest function/heuristic, to make a distinction
> between {{offset}} and {{limit}}. In terms of the need for overrequest,
> "offset=10, limit=10" should be equivalent to "limit=20" -- but the existing
> code treats these requests differently: adding overrequest to the latter, but
> not the former.
> I propose:
> At a minimum, for logical consistency, default distributed facet overrequest
> should be converted to be a function/heuristic based on {{offset+limit}}.
> Having thought through this somewhat, I suggest that the overrequest function
> should be a monotonically non-decreasing function (to avoid having an
> arbitrary inflection point where requesting _more_ values disables
> overrequest and results in _fewer_ values being requested in the initial
> phase). This function could simply be a static boost (e.g., {{f\(x)=x+10}}),
> or could be a function that decays to the identity function for higher
> numbers of requested values where overrequest becomes unnecessary (e.g.,
> {{f\(x)=x+(90/(7+x))}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]