[
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412599#comment-15412599
]
Hoss Man commented on SOLR-9395:
--------------------------------
misc thoughts...
bq. Now one question might be, why not do this with a filter query? In many
cases you don't necessarily want to filter these documents from the main search
results. You just want to eliminate outliers from a specific stats calculation.
A different question: Rather then adding a customization to the stats params,
would it be more generally useful to implement these as new ValueSource
wrappers? (along the lines of the existing "map" and "if" functions) and use
them with the existing support for computing stats over arbitrary functions.
Something like...
{noformat}
stats.field={!func}upperBound(lowerBound(age,18),60)
{noformat}
where {{lowerBound}} and {{upperBound}} are implemented such that they return
the value they are wrapping, but their {{exists()}} method only returns
{{true}} if the constraint is met.
Or maybe generalize the idea to add a {{filter}} function that can wrap another
functions but {{exists()}} method only returns {{true}} if some 1 or more
queries all match the document...
{noformat}
func.filter.q=age:[18 TO 60]
stats.field={!func}filter(age, $func.filter.q)
{noformat}
For that matter -- if we had a {{novalue()}} function such that it's
{{exists()}} method only returns {{false}} regardless of the document, we could
also just do...
{noformat}
stats.field={!func}if(and(gte(age,18),lte(age,60)),age,novalue())
{noformat}
...those function based approaches all seem like they would have additional
value above and beyond just constraining stats that might make them a better
general API for this type of problem.
----
However...
>From what i've seen skimming the patch, the new floor/ceil params you're
>proposing wouldn't work quite the same way as what i'm suggesting, because it
>looks like what you've implemented doesn't actually cause values out of range
>to be "ignored", they are actually explicitly counted in new {{outOfBounds}}
>stat result value -- if we went the function route that wouldn't be there,
>these "out of bounds" docs would just be counted as {{missing}} the same as if
>they didn't have any value in the specified field.
i'm not sure how i feel about the {{outOfBounds}} stat value ... part of me
thinks it's handy, but part of me thinks it's kind of niche, and people who
care about the distinction between that and {{missing}} might just as easily
care about the distinction between "below the lower bound" and "above the upper
bound" which I think would be just as easy with a function based approach as
with a stats specific based approach...
{noformat}
stats.field={!func key=18_to_60 mean=true
missing=true}if(and(gte(age,18),lte(age,60)),age,novalue())
stats.field={!func key=under_18_or_unknown
count=true}if(lt(age,18),age,novalue())
stats.field={!func key=over_60_or_unknown
count=true}if(gt(age,60),age,novalue())
{noformat}
in otherwords: if we think people will actually care about "out of bounds",
then it seems like a strong argument to go the function route, so they can get
specific details of where/how/why values are out of bounds.
----
Either way, one large concern I have is over the proposed API using the terms
{{floor}} and {{ceil}} .... we should definitely _not_ use those terms for this
purpose, as in the context of math/stats/numeric values they are largely
(universally?) interpreted to refer to mapping a "real" number to the nearest
"integer" value up/down on the "real" number scale ... and that is most
certainly not at all what's happening here.
> Add ceil/floor bounding to stats calculations
> ---------------------------------------------
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: master (7.0)
> Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters
> to a field being computed via the stats component. This bounds the stats
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true&stats.field=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can
> reissue this query as
> stats=true&stats.field={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases
> you don't necessarily want to filter these documents from the main search
> results. You just want to eliminate outliers from a specific stats
> calculation. For example, you search your employee database for "clerks." You
> still want to see all the clerks, even little 16 year old Timmy. But for this
> particular calculation you just want to focus on folks of traditional working
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]