[jira] [Assigned] (SOLR-15760) Improve default distributed facet overrequest function/heuristic

Michael Gibney (Jira) Tue, 02 Nov 2021 09:49:05 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Gibney reassigned SOLR-15760:
-------------------------------------

    Assignee: Michael Gibney

> Improve default distributed facet overrequest function/heuristic
> ----------------------------------------------------------------
>
>                 Key: SOLR-15760
>                 URL: https://issues.apache.org/jira/browse/SOLR-15760
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: main (9.0)
>            Reporter: Michael Gibney
>            Assignee: Michael Gibney
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In {{FacetFieldProcessor}} for distributed requests, additive 
> {{facet.overrequest}} can be specified as an integer. In the absence of an 
> explicitly specified value, a default overrequest is calculated according to 
> the following function/heuristic:
> {code:java}
>       if (fcontext.isShard()) {
>         if (freq.overrequest == -1) {
>           // add over-request if this is a shard request and if we have a 
> small offset (large offsets will already be gathering many more buckets than 
> needed)
>           if (freq.offset < 10) {
>             effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: 
> add 10% plus 4 (to overrequest for very small limits)
>           }
>         }...
>       } ...
> {code}
> The logic of overrequesting is: there is some balance of redundant vs. unique 
> values values across different shards in a distributed context, with uneven 
> distribution across shards. This can result in a situation where values with 
> uniform distribution across shards may be erroneously excluded from results, 
> in favor of values that may occur less frequently overall, but that (due to 
> uneven distribution) occur very frequently on some nodes and not on others.
> "Refinement" is designed to ensure that all top-level values are informed by 
> data from all shards -- but in the situation that overrequesting is designed 
> to mitigate, refinement can't help for values that are excluded even from the 
> initial request phase.
> The main issue is casting a wide-enough initial net. As such, the problem is 
> more acute (calling for more of an "overrequest boost") when requesting 
> relatively small numbers of values.
> The problem with the existing code is that iiuc, there is no reason from the 
> point of view of the overrequest function/heuristic, to make a distinction 
> between {{offset}} and {{limit}}. In terms of the need for overrequest, 
> "offset=10, limit=10" should be equivalent to "limit=20" -- but the existing 
> code treats these requests differently: adding overrequest to the latter, but 
> not the former.
> I propose:
> At a minimum, for logical consistency, default distributed facet overrequest 
> should be converted to be a function/heuristic based on {{offset+limit}}.
> Having thought through this somewhat, I suggest that the overrequest function 
> should be a monotonically non-decreasing function (to avoid having an 
> arbitrary inflection point where requesting _more_ values disables 
> overrequest and results in _fewer_ values being requested in the initial 
> phase). This function could simply be a static boost (e.g., {{f\(x)=x+10}}), 
> or could be a function that decays to the identity function for higher 
> numbers of requested values where overrequest becomes unnecessary (e.g., 
> {{f\(x)=x+(90/(7+x))}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SOLR-15760) Improve default distributed facet overrequest function/heuristic

Reply via email to