[
https://issues.apache.org/jira/browse/SOLR-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519008#comment-17519008
]
Michael Gibney commented on SOLR-16144:
---------------------------------------
[~hossman], I'm curious to know your perspective on this, if you have a chance
to take a look? (Adding a test would be relatively straightforward, though even
a minimal/sanity-check test would require an index with at least 100,001 docs,
I think).
> Don't internally round [foreground|background]_popularity values in
> RelatednessAgg
> ----------------------------------------------------------------------------------
>
> Key: SOLR-16144
> URL: https://issues.apache.org/jira/browse/SOLR-16144
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: main (10.0)
> Reporter: Michael Gibney
> Priority: Trivial
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The "relatedness" facet function supports the concept of
> {{foreground_popularity}} and {{background_popularity}} -- i.e., the
> cardinality of the intersection of bucket domain with the foreground and
> background sets (respectively), each normalized with respect to background
> set cardinality.
> The logic appears to be:
> # To provide clients with context of computed relatedness values
> # To preemptively (optionally) screen out "noise" from low-frequency terms
> via the {{min_popularity}} function parameter.
> For both purposes, popularity values are currently rounded to 5 digits.
> This issue proposes that although rounding to 5 digits makes sense for the
> _first_ case (providing context to clients), this arbitrary truncation does
> not make sense as currently implemented for internally evaluating threshold
> pop values for bucket inclusion.
> Consider the case of a high-cardinality field with a relatively large
> background set and a selective foreground set. For {{|background_set| =
> 2,000,000}} and a foreground set of cardinality 9, even a bucket with a
> domain that exactly matches the foreground set would be screened out, for
> _any_ explicit setting of {{min_popularity}}.
> This behavior is due to where the rounding takes place (internally, upon
> initial {{computeDerivedValues()}}). It is further problematic that
> {{RelatednessAgg}} will currently accept {{min_popularity < 0.00001}}, which
> would be guaranteed to exclude _all_ buckets.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]