Hoss Man created SOLR-11725:
-------------------------------

             Summary: json.facet's stddev() function should be changed to use 
the "Corrected sample stddev" formula
                 Key: SOLR-11725
                 URL: https://issues.apache.org/jira/browse/SOLR-11725
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Hoss Man



While working on some equivalence tests/demonstrations for 
{{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
calculations done between the two code paths can be measurably different, and 
realized this is due to them using very different code...

* {{json.facet=foo:stddev(foo)}}
** {{StddevAgg.java}}
** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
* {{stats.field=\{!stddev=true\}foo}}
** {{StatsValuesFactory.java}}
** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
1.0D)))}}

Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
nerds I know online to help me sanity check if these equations (some how) 
reduced to eachother (In which case the discrepancies I was seeing in my 
results might have just been due to the order of intermediate operation 
execution & floating point rounding differences).

They confirmed that the two bits of code are _not_ equivalent to each other, 
and explained that the code JSON Faceting is using is equivalent to the 
"Uncorrected sample stddev" formula, while StatsComponent's code is equivalent 
to the the "Corrected sample stddev" formula...

https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation

When I told them that stuff like this is why no one likes mathematicians and 
pressed them to explain which one was the "most canonical" (or "most generally 
applicable" or "best") definition of stddev, I was told that:

# This is something statisticians frequently disagree on
# Practically speaking the diff between the calculations doesn't tend to differ 
significantly when count is "very large"
# _"Corrected sample stddev" is more appropriate when comparing two 
distributions_

Given that:

* the primary usage of computing the stddev of a field/function against a Solr 
result set (or against a sub-set of results defined by a facet constraint) is 
probably to compare that distribution to a different Solr result set (or to 
compare N sub-sets of results defined by N facet constraints)
* the size of the sets of documents (values) can be relatively small when 
computing stats over facet constraint sub-sets

...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
sample stddev" equation.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to