[
https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282264#comment-16282264
]
Hoss Man commented on SOLR-11733:
---------------------------------
bq. I mentioned in SOLR-11729 the refinement algorithm being different (and for
a single-level facet field, simpler).
FWIW, here's yonik's comment from SOLR-11729 which seems to specifically be on
point for this issue (emphaiss mine)...
bq. It seems like there are many logical ways to refine results - I originally
thought about using refine:simple because I imagined we would have other
implementations in the future. Anyway, this one is the simplest one to think
about and implement: *the top buckets to return for all facets are determined
in the first phase.* The second phase only gets contributions from other shards
for those buckets.
bq. i.e. simple refinement doesn't change the buckets you get back.
Ah ... ok. I didn't realize the refinement approach in {{json.facet}} wasn't
as sophisticated as {{facet.field}}
To summarize again (in my own words to ensure I'm understanding you correctly):
# do a first pass, requesting "#limit + #overrequest" buckets from each shard
#* use the accumulated results of the first pass to determine the "top #limit
buckets"
# do a second passs, in which we back-fill the "top #limit buckets" with data
from any shards that have no yet contributed.
In which case, in my example above, the reason {{yyy}} isn't refined, even
though it has the same "first pass" total as {{x1}}, is because during the
first pass {{x1}} sorts higher (due to a secondary tie breaker sort on the
terms) pushing {{yyy}} out of the "top 6". (likewise {{x2}} and {{tail}} are
never considered because they were never part of the "top 6" even w/o a tie
breaker sort)
Do I have that correct?
----
The Bottom line: even if i don't fully grasp the current refinement mechanism
you've described, is that you're saying the behavior i described with the above
sample documents is *not* a bug: it's the intended/expected behavior of
{{refine:true}} (aka {{refine:simple}} )
If so i'll edit this jira into an "Improvement" & update the
summary/description to clarify how {{facet.pivot}} refinement differs from
{{json.facet}} + {{refine:simple}} & leave open for future improvement
----
----
As far as discussion on potential improvements....
bq. From a correctness POV, smarter faceting is equivalent to increasing the
overrequest amount... we still can't make guarantees.
Hmmm... I'm not sure that i agree with that assessment. I guess
"mathematically" speaking it's true that compared to a "smarter" refinement
method, this "simple" refine method can product equally "correct" top terms
solely by increasing the overrequest amount -- but that's like saying we don't
even need any refinement method at all as long as we specify an infinite amount
of overrequest.
With the refinement approach used by {{facet.field}} (and {{facet.pivot}}) we
*can* make garuntees about the correctness of the top terms -- regardless of
if/how-much overrequesting is used -- _for any term that is in the "top
buckets" of at least one shard_.
IIUC the current {{json.facet}} refinement method can't make _any_ similar
garuntees at all, regardless of what (finite) overrequest value is specified
... but {{facet.field}} certainly can:
In {{facet.field}} today, If:
* A term is in the "top buckets" (limit + overrequest) returned by at least one
shard
* And the sort value (ie: count) returned by that shard (along with the lowest
sort-value/count returned by all other shards) indicates that the term _might_
be competitive realtive to the other terms returned by other shards
...then that term is refined. That's a garuntee we can make.
Meaning that even if you have shards with widely diff term stats (ie: time
partioned shards, or docs co-located due to multi-level compositeId, or block
join, etc..) we can/will refine the top terms from each shard.
In {{facet.field}} the overrequest helps to:
* increase the scope of how deep we look to find the "top (candidate) terms"
from each shard
* decreases the amount of data we have to request when refineing
...but the *distribution* of terms across shards has very little (none? ... not
certain) impact on the "correctness" of the "top N" in the aggregate. Even if
the first pass "top terms" from each shard is 100% unique, the *realtive*
"bottom" counts from each shard is considered before assuming that the "higher"
counts should win -- meaning that if the shards have very different sizes, "top
terms" from the smaller shards still have a chance of being considered as an
"aggregated top term" as long as the "bottom count" from the (larger) shards is
high enough to indicate that those (missing) terms might still be competitive.
But in the {{json.facet}} approach to refinement, IIUC: A term returned by only
one shard won't be considered unless the count from _just that one shard_ is
high enough to help it dominate over the *cumulative* counts from each of the
top terms of the other shards.
Which seems to not only make the amount of overrequesting _much_ more important
to consider when requesting refinement, but also requires you to consider the
comparative *sizes* of the shards, and the potential term distribution
variances between them.
Or to put it another way...
*TL,DR: IIUC, the amount of overrequest is _much_ more important to consider
when requesting refinement on {{json.facet}} then it has ever been with
{{facet.field}}, but when picking an overrequest amount for {{json.facet}},
people also need to consider the relative differences in _sizes_ of their
shards, and the potential term distribution variances that may exist between
them.*
(correct?)
----
bq. We could easily implement a mode for some field facets that does the "could
this possibly be in the top N" logic to consider more buckets in the first
phase... but only if it's not a sub-facet of another partial facet (a facet
with something like a limit). If we're sorting by something other than count
(like stddev for instance) then I guess we'd have to discard smart pruning and
just try to get all buckets we saw in the first phase.
You lost me there.... If the sort is on some criteria other then count (ex:
stddev), why can't we compute a hypothetical "best case" sort value for the
candidates based on the pre-aggregation values returned by the "bottom" of the
other shards (ex: the sum, sumsq, and num_values already needed from each shard
for the aggregated stddev) in combination with the values from the one shard
that *does* have that term?
bq. If a partial facet is a sub-facet of another partial-facet, the logic of
what one can exclude seems to get harder, ...
You _completely_ lost me there ... I *think* maybe you're alluding to the need
for multi-stage refinement depending on how deep the nested facets go? which
FWIW is exactly what {{facet.pivot}} does today.
> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
> Key: SOLR-11733
> URL: https://issues.apache.org/jira/browse/SOLR-11733
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in
> the "top n + overrequest" for at least 1 shard aren't getting refined and
> included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to
> reproduce that i'll post in a comment shortly
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]