[
https://issues.apache.org/jira/browse/SOLR-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Gibney reassigned SOLR-15836:
-------------------------------------
Assignee: Michael Gibney
> Address counterintuitive behavior of JSON "terms" subfacet refinement
> ---------------------------------------------------------------------
>
> Key: SOLR-15836
> URL: https://issues.apache.org/jira/browse/SOLR-15836
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Affects Versions: main (9.0), 8.11
> Reporter: Michael Gibney
> Assignee: Michael Gibney
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In distributed faceting, uneven distribution of terms across different shards
> can artificially include or exclude terms (this discussion will focus on JSON
> Facet "terms" faceting).
> This is inevitable, and can be mitigated via {{overrequest}} and
> {{overrefine}} parameters -- respectively casting a "wider net" for "phase#1"
> (determining the set of "terms of interest") and "phase#2" (cross-checking
> "terms of interest" against terms that did not initially report them).
> It is possible to devise artificial situations that push the limit of what
> {{overrefine}} is capable of mitigating, resulting in counterintuitive
> behavior. But despite such edge cases, in general it is relatively
> straightforward to reason about how the {{simple}} JSON Facet refinement
> method works for "flat" (i.e., non-hierarchical) terms facets.
> This issue discusses some ways in which subfacets (hierarchical or nested
> facets) can more readily behave counterintuitively in practical usage, and
> possible ways to address/mitigate such behavior.
> ---------------------
> AFAICT, the {{simple}} (default, currently the only) refinement method has
> two defining requirements:
> # there is at most _one_ refinement request issued to each shard, and
> # any buckets returned are guaranteed to have accurate counts (or perhaps
> more generally, stats?) reflecting contributions from all shards. (this makes
> [no
> guarantees|https://issues.apache.org/jira/browse/SOLR-11159?focusedCommentId=16103386#comment-16103386]
> about buckets _not_ returned that would in principle be eligible to be
> returned).
>
> The simplest counterintuitive case is when refinement of higher-level facets
> uncovers more subfacets on shards that have no opportunity to influence
> results/refinement of the child facet. I'm pretty sure it's this situation
> that's described in [this
> comment|https://github.com/apache/solr/blob/0287458f836e3b7ea4b2401538b29f3d2e9b6cf4/solr/core/src/test/org/apache/solr/search/facet/TestJsonFacetRefinement.java#L992-L994]
> (by [~hossman]?):
> {code:java}
> // - or at the very least, if the purpose of "_l" is to give other
> buckets a chance to "bubble up"
> // in phase#2, then shouldn't a "_l" refinement requests still
> include the buckets choosen in
> // phase#1, and request that the shard fill them in in addition to
> returning its own top buckets?
> {code}
> The proposal in the above linked comment would work iff the "own top buckets"
> returned in phase#2 did not introduce any new/unseen values (and note, the
> only case in which returning "own top buckets" would be significant _would_
> be the case in which it would introduce new/unseen values). If new values
> _were_ returned in phase#2, the only way to ensure that requirement2 is
> respected would be to violate requirement1 (i.e., by issuing _another_
> refinement request to determine whether any other shards have anything to
> contribute to the previously unseen value).
> This counterintuitive behavior can't exactly be called a "bug", because IIUC
> the intuitive behavior is fundamentally incompatible with the current
> default/only {{simple}} refinement method.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]