[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

Yonik Seeley (JIRA) Thu, 28 Dec 2017 21:14:28 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305994#comment-16305994
 ]


Yonik Seeley commented on SOLR-11725:
-------------------------------------

{quote}
> ...In general we've been moving toward omitting undefined functions. Stats 
> like min() and max() already do this.
Whoa... really? ... that seems like it would make th client parsing realy 
hard...
{quote}

Trying to remember.  I *think* it may have just worked out that way originally 
when null is returned as the value from SlotAcc.getValue()
And I may have also conflated "empty bucket" with "stat over no values".  I'm 
not sure if client parsing is really much harder since a map interface of 
bucket.get("mystat") would return null in both cases.
On the other hand, I can see how it could be confusing to request a stat and 
not see it at all in the response.  Overall I guess I'm leaning toward 
returning "mystat":null for a non-empty bucket where mystat has no value / 
undefined value.

bq. For a singleton set, the stddev() should absolutely be "0"

Standard deviation of a population of size 1, yes. But this issue was about 
switching to standard deviation of samples, and that is undefined (or infinite) 
for a single sample.
Python throws an exception: 
https://docs.python.org/3/library/statistics.html#statistics.stdev
Google sheets will return a div-by-0 error: 
https://support.google.com/docs/answer/3094054?hl=en
Excel also gives a div-by-0 error with a single value.  I can't find anything 
using the "N-1" variant that uses 0 for a single sample.



> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

Reply via email to