[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281837#comment-16281837
 ] 

Yonik Seeley edited comment on SOLR-11725 at 12/7/17 2:03 PM:
--------------------------------------------------------------

+1 for changing... "N-1" is the more standard form.

bq. Attaching a trivial patch containing the change Hoss spelled out above.
Note that the accumulator needs to be changed as well for non-distributed 
stddev calculation.  The Merger is not used in that case.

This does bring up the question of what to do when N=1 (or N=0 for that matter).
Standard deviation of a population of N=1 is 0, but of a sample of N=1 is 
undefined (or infinity?)

When N=0, the current code produces 0, but I don't think that's the best choice.
In general we've been moving toward omitting undefined functions.  Stats like 
min() and max() already do this.

TestJsonFacets has this:
{code}
    // stats at top level, matching documents, but no values in the field
    // NOTE: this represents the current state of what is returned, not the 
ultimate desired state.
    client.testJQ(params(p, "q", "id:3"
        , "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', 
avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" +
            ", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', 
unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" +
            ", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', 
hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" +
            ", med:'percentile(${num_d},50)', 
perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', 
stddev:'stddev(${num_d})' }"
        )
        , "facets=={count:1 " +
            ",sum1:0.0," +
            " sumsq1:0.0," +
            " avg1:0.0," +   // TODO: undesirable. omit?
            // " min1:'NaN'," +
            // " max1:'NaN'," +
            " numwhere:0," +
            " unique_num_i:0," +
            " unique_num_d:0," +
            " unique_date:0," +
            " where_hll:0," +
            " hll_num_i:0," +
            " hll_num_d:0," +
            " hll_date:0," +
            " variance:0.0," +
            " stddev:0.0" +
            " }"
    );
{code}

I'd be tempted to treat N=0 and N=1 as undefined, and omit them.  Note that we 
need to be careful to have the N=1 case still contribute to a distributed 
bucket, even if it's undefined locally!
In the distributed case, N=0 is normally handled generically for anything that 
doesn't produce a result (they are "missing"/null and are sorted after anything 
that has a value).  Things may work if we make getDouble() return 0 (for 
sorting), but then override getMergedResult() to return null when N <= 1.

Oh, and whatever treatment we give stddev(), we should presumably give to 
variance()?

When thinking about the sorting, it occurs to me that there is probably a minor 
sorting bug.
N=0 is treated the same as a true 0 standard deviation, but they shouldn't sort 
equivalently.
But we still have a sorting bug locally anyway with respect to 
sort-missing-last: SOLR-10618


was (Author: [email protected]):
+1 for changing... "N-1" is the more standard form.

bq. Attaching a trivial patch containing the change Hoss spelled out above.
Note that the accumulator needs to be changed as well for non-distributed 
stddev calculation.  The Merger is not used in that case.

This does bring up the question of what to do when N=1 (or N=0 for that matter).
Standard deviation of a population of N=1 is 0, but of a sample of N=1 is 
undefined (or infinity?)

When N=0, the current code produces 0, but I don't think that's the best choice.
In general we've been moving toward omitting undefined functions.  Stats like 
min() and max() already do this.

TestJsonFacets has this:
{code}
    // stats at top level, matching documents, but no values in the field
    // NOTE: this represents the current state of what is returned, not the 
ultimate desired state.
    client.testJQ(params(p, "q", "id:3"
        , "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', 
avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" +
            ", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', 
unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" +
            ", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', 
hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" +
            ", med:'percentile(${num_d},50)', 
perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', 
stddev:'stddev(${num_d})' }"
        )
        , "facets=={count:1 " +
            ",sum1:0.0," +
            " sumsq1:0.0," +
            " avg1:0.0," +   // TODO: undesirable. omit?
            // " min1:'NaN'," +
            // " max1:'NaN'," +
            " numwhere:0," +
            " unique_num_i:0," +
            " unique_num_d:0," +
            " unique_date:0," +
            " where_hll:0," +
            " hll_num_i:0," +
            " hll_num_d:0," +
            " hll_date:0," +
            " variance:0.0," +
            " stddev:0.0" +
            " }"
    );
{code}

I'd be tempted to treat N=0 and N=1 as undefined, and omit them.  Note that we 
need to be careful to have the N=1 case still contribute to a distributed 
bucket, even if it's undefined locally!
In the distributed case, N=0 is normally handled generically for anything that 
doesn't produce a result (they are "missing"/null and are sorted after anything 
that has a value).  Things may work if we make getDouble() return 0 (for 
sorting), but then override getMergedResult() to return null when N <= 1.

Oh, and whatever treatment we give stddev(), we should presumably give to 
variance()?



> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to