[
https://issues.apache.org/jira/browse/SOLR-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Gibney updated SOLR-15760:
----------------------------------
Description:
In {{FacetFieldProcessor}} for distributed requests, additive
{{facet.overrequest}} can be specified as an integer. In the absence of an
explicitly specified value, a default overrequest is calculated according to
the following function/heuristic:
{code:java}
if (fcontext.isShard()) {
if (freq.overrequest == -1) {
// add over-request if this is a shard request and if we have a small
offset (large offsets will already be gathering many more buckets than needed)
if (freq.offset < 10) {
effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: add
10% plus 4 (to overrequest for very small limits)
}
}...
} ...
{code}
The logic of overrequesting is: there is some balance of redundant vs. unique
values values across different shards in a distributed context, with uneven
distribution across shards. This can result in a situation where values with
uniform distribution across shards may be erroneously excluded from results, in
favor of values that may occur less frequently overall, but that (due to uneven
distribution) occur very frequently on some nodes and not on others.
"Refinement" is designed to ensure that all top-level values are informed by
data from all shards -- but in the situation that overrequesting is designed to
mitigate, refinement can't help for values that are excluded even from the
initial request phase.
The main issue is casting a wide-enough initial net. As such, the problem is
more acute (calling for more of an "overrequest boost") when requesting
relatively small numbers of values.
The problem with the existing code is that iiuc, there is no reason from the
point of view of the overrequest function/heuristic, to make a distinction
between {{offset}} and {{limit}}. In terms of the need for overrequest,
"offset=10, limit=10" should be equivalent to "limit=20" -- but the existing
code treats these requests differently: adding overrequest to the latter, but
not the former.
I propose:
At a minimum, for logical consistency, default distributed facet overrequest
should be converted to be a function/heuristic based on {{offset+limit}}.
Having thought through this somewhat, I suggest that the overrequest function
should be a monotonically non-decreasing function (to avoid having an arbitrary
inflection point where requesting _more_ values disables overrequest and
results in _fewer_ values being requested in the initial phase). This function
could simply be a static boost (e.g., {{f\(x)=x+10}}), or could be a function
that decays to the identity function for higher numbers of requested values
where overrequest becomes unnecessary (e.g., {{f\(x)=x+(90/(7+x))}}).
was:
In {{FacetFieldProcessor}} for distributed requests, additive
{{facet.overrequest}} can be specified as an integer. In the absence of an
explicitly specified value, a default overrequest is calculated according to
the following function/heuristic:
{code:java}
if (fcontext.isShard()) {
if (freq.overrequest == -1) {
// add over-request if this is a shard request and if we have a small
offset (large offsets will already be gathering many more buck
ets than needed)
if (freq.offset < 10) {
effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default: add
10% plus 4 (to overrequest for very small limits)
}
}...
} ...
{code}
The logic of overrequesting is: there is some balance of redundant vs. unique
values values across different shards in a distributed context, with uneven
distribution across shards. This can result in a situation where values with
uniform distribution across shards may be erroneously excluded from results, in
favor of values that may occur less frequently overall, but that (due to uneven
distribution) occur very frequently on some nodes and not on others.
"Refinement" is designed to ensure that all top-level values are informed by
data from all shards -- but in the situation that overrequesting is designed to
mitigate, refinement can't help for values that are excluded even from the
initial request phase.
The main issue is casting a wide-enough initial net. As such, the problem is
more acute (calling for more of an "overrequest boost") when requesting
relatively small numbers of values.
The problem with the existing code is that iiuc, there is no reason from the
point of view of the overrequest function/heuristic, to make a distinction
between {{offset}} and {{limit}}. In terms of the need for overrequest,
"offset=10, limit=10" should be equivalent to "limit=20" -- but the existing
code treats these requests differently: adding overrequest to the latter, but
not the former.
I propose:
At a minimum, for logical consistency, default distributed facet overrequest
should be converted to be a function/heuristic based on {{offset+limit}}.
Having thought through this somewhat, I suggest that the overrequest function
should be a monotonically non-decreasing function (to avoid having an arbitrary
inflection point where requesting _more_ values disables overrequest and
results in _fewer_ values being requested in the initial phase). This function
could simply be a static boost (e.g., {{f(x)=x+10}}), or could be a function
that decays to the identity function for higher numbers of requested values
where overrequest becomes unnecessary (e.g., {{f(x)=x+(90/(7+x))}}).
> Improve default distributed facet overrequest function/heuristic
> ----------------------------------------------------------------
>
> Key: SOLR-15760
> URL: https://issues.apache.org/jira/browse/SOLR-15760
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: main (9.0)
> Reporter: Michael Gibney
> Priority: Minor
>
> In {{FacetFieldProcessor}} for distributed requests, additive
> {{facet.overrequest}} can be specified as an integer. In the absence of an
> explicitly specified value, a default overrequest is calculated according to
> the following function/heuristic:
> {code:java}
> if (fcontext.isShard()) {
> if (freq.overrequest == -1) {
> // add over-request if this is a shard request and if we have a
> small offset (large offsets will already be gathering many more buckets than
> needed)
> if (freq.offset < 10) {
> effectiveLimit = (long) (effectiveLimit * 1.1 + 4); // default:
> add 10% plus 4 (to overrequest for very small limits)
> }
> }...
> } ...
> {code}
> The logic of overrequesting is: there is some balance of redundant vs. unique
> values values across different shards in a distributed context, with uneven
> distribution across shards. This can result in a situation where values with
> uniform distribution across shards may be erroneously excluded from results,
> in favor of values that may occur less frequently overall, but that (due to
> uneven distribution) occur very frequently on some nodes and not on others.
> "Refinement" is designed to ensure that all top-level values are informed by
> data from all shards -- but in the situation that overrequesting is designed
> to mitigate, refinement can't help for values that are excluded even from the
> initial request phase.
> The main issue is casting a wide-enough initial net. As such, the problem is
> more acute (calling for more of an "overrequest boost") when requesting
> relatively small numbers of values.
> The problem with the existing code is that iiuc, there is no reason from the
> point of view of the overrequest function/heuristic, to make a distinction
> between {{offset}} and {{limit}}. In terms of the need for overrequest,
> "offset=10, limit=10" should be equivalent to "limit=20" -- but the existing
> code treats these requests differently: adding overrequest to the latter, but
> not the former.
> I propose:
> At a minimum, for logical consistency, default distributed facet overrequest
> should be converted to be a function/heuristic based on {{offset+limit}}.
> Having thought through this somewhat, I suggest that the overrequest function
> should be a monotonically non-decreasing function (to avoid having an
> arbitrary inflection point where requesting _more_ values disables
> overrequest and results in _fewer_ values being requested in the initial
> phase). This function could simply be a static boost (e.g., {{f\(x)=x+10}}),
> or could be a function that decays to the identity function for higher
> numbers of requested values where overrequest becomes unnecessary (e.g.,
> {{f\(x)=x+(90/(7+x))}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]