[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305994#comment-16305994 ] Yonik Seeley commented on SOLR-11725: - {quote} > ...In general we've been moving toward omitting undefined functions. Stats > like min() and max() already do this. Whoa... really? ... that seems like it would make th client parsing realy hard... {quote} Trying to remember. I *think* it may have just worked out that way originally when null is returned as the value from SlotAcc.getValue() And I may have also conflated "empty bucket" with "stat over no values". I'm not sure if client parsing is really much harder since a map interface of bucket.get("mystat") would return null in both cases. On the other hand, I can see how it could be confusing to request a stat and not see it at all in the response. Overall I guess I'm leaning toward returning "mystat":null for a non-empty bucket where mystat has no value / undefined value. bq. For a singleton set, the stddev() should absolutely be "0" Standard deviation of a population of size 1, yes. But this issue was about switching to standard deviation of samples, and that is undefined (or infinite) for a single sample. Python throws an exception: https://docs.python.org/3/library/statistics.html#statistics.stdev Google sheets will return a div-by-0 error: https://support.google.com/docs/answer/3094054?hl=en Excel also gives a div-by-0 error with a single value. I can't find anything using the "N-1" variant that uses 0 for a single sample. > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations doesn't tend to > differ significantly when count is "very large" > # _"Corrected sample stddev" is more appropriate when comparing two > distributions_ > Given that: > * the primary usage of computing the stddev of a field/function against a > Solr result set (or against a sub-set of results defined by a facet > constraint) is probably to compare that distribution to a different Solr > result set (or to compare N sub-sets of results defined by N facet > constraints) > * the size of the sets of documents (values) can be relatively small when > computing stats over facet constraint sub-sets > ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected > sample stddev" equation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288090#comment-16288090 ] ASF subversion and git services commented on SOLR-11725: Commit 2990c88a927213177483b61fe8e6971df04fc3ed in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2990c88 ] Beef up testing of json.facet 'refine:simple' when dealing with 'Long Tail' terms In an attempt to get more familiar with json.facet refinement, I set out to try and refactor/generalize/clone some of the existing facet.pivot refinement tests to assert that json.facet could produce the same results. This test is a baby step towards doing that: Cloning DistributedFacetPivotLongTailTest into DistributedFacetSimpleRefinementLongTailTest (with shared index building code). Along the way, I learned that the core logic of 'refine:simple' is actually quite different then how facet.field & facet.pivot work (see discussion in SOLR-11733), so they do *NOT* produce the same results in many "Long Tail" Sitautions. As a result, many of the logic/assertions inDistributedFacetSimpleRefinementLongTailTest are very differnet then their counter parts in DistributedFacetPivotLongTailTest, with detailed explanations in comments. Hopefully this test will prove useful down the road to anyone who might want to compare/contrast facet.pivot with json.facet, and to prevent regressions in 'refine:simple' if/when we add more complex refinement approaches in the future. There are also a few TODOs in the test related to some other small discrepencies between json.facet and stats.field that I opened along the way, indicating where the tests should be modified once those issues are addressed in json.facet... - SOLR-11706: support for multivalued numeric fields in stats - SOLR-11695: support for 'missing()' & 'num_vals()' (aka: 'count' from stats.field) numeric stats - SOLR-11725: switch from 'uncorrected stddev' to 'corrected stddev' > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations doesn't tend to > differ significantly when count is "very large" > # _"Corrected sample stddev" is more appropriate when comparing two > distributions_ > Given that: > * the primary usage of computing the stddev of a field/function against a > Solr result set (or against a sub-set of results defined by a facet > constraint) is probably to compare that distribution to a different Solr > result set (or to compare N sub-sets of results defined by N facet > constraints) > * the size of the sets of documents (values) can be relatively small when > computing stats over facet constraint sub-sets > ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected > sample stddev" equation. -- This message
[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288086#comment-16288086 ] ASF subversion and git services commented on SOLR-11725: Commit 53f2d4aa3aa171d5f37284eba9ca56d987729796 in lucene-solr's branch refs/heads/branch_7x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=53f2d4a ] Beef up testing of json.facet 'refine:simple' when dealing with 'Long Tail' terms In an attempt to get more familiar with json.facet refinement, I set out to try and refactor/generalize/clone some of the existing facet.pivot refinement tests to assert that json.facet could produce the same results. This test is a baby step towards doing that: Cloning DistributedFacetPivotLongTailTest into DistributedFacetSimpleRefinementLongTailTest (with shared index building code). Along the way, I learned that the core logic of 'refine:simple' is actually quite different then how facet.field & facet.pivot work (see discussion in SOLR-11733), so they do *NOT* produce the same results in many "Long Tail" Sitautions. As a result, many of the logic/assertions inDistributedFacetSimpleRefinementLongTailTest are very differnet then their counter parts in DistributedFacetPivotLongTailTest, with detailed explanations in comments. Hopefully this test will prove useful down the road to anyone who might want to compare/contrast facet.pivot with json.facet, and to prevent regressions in 'refine:simple' if/when we add more complex refinement approaches in the future. There are also a few TODOs in the test related to some other small discrepencies between json.facet and stats.field that I opened along the way, indicating where the tests should be modified once those issues are addressed in json.facet... - SOLR-11706: support for multivalued numeric fields in stats - SOLR-11695: support for 'missing()' & 'num_vals()' (aka: 'count' from stats.field) numeric stats - SOLR-11725: switch from 'uncorrected stddev' to 'corrected stddev' (cherry picked from commit 2990c88a927213177483b61fe8e6971df04fc3ed) > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations doesn't tend to > differ significantly when count is "very large" > # _"Corrected sample stddev" is more appropriate when comparing two > distributions_ > Given that: > * the primary usage of computing the stddev of a field/function against a > Solr result set (or against a sub-set of results defined by a facet > constraint) is probably to compare that distribution to a different Solr > result set (or to compare N sub-sets of results defined by N facet > constraints) > * the size of the sets of documents (values) can be relatively small when > computing stats over facet constraint sub-sets > ...it seems like {{StddevAgg.java}} should be
[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282354#comment-16282354 ] Hoss Man commented on SOLR-11725: - bq. This does bring up the question of what to do when N=1 (or N=0 for that matter). I ommitted them from my original description for brevity to focus on the bigger picture of the equations, but for the record the full implemetnion of stddev in each of the two classes mentioned are... * {{StddevAgg.java}}: {code} double val = count == 0 ? 0.0d : Math.sqrt((sumSq/count)-Math.pow(sum/count, 2)); return val; {code} * {{StatsValuesFactory.java}}: {code} if (count <= 1.0D) { return 0.0D; } return Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 1.0D))); {code} bq. When N=0, the current code produces 0, but I don't think that's the best choice. ... Agreed, it should really be 'null' (or 'NaN') (i'm not sure why {{StatsValuesFactory.java}} currently returns {{0.0D}} when {{count==0}} ... other {{StatsValuesFactory.java}} stats like min/max correctly return 'null' ... it's weird) bq. ...In general we've been moving toward omitting undefined functions. Stats like min() and max() already do this. Whoa... really? ... that seems like it would make th client parsing realy hard... You're saying users can't expect that every "facet" key they specify in the request will be include in the response? (in the event it's 'null' or 'NaN' or whatever makes sense given it's data type) Why??? bq. I'd be tempted to treat N=0 and N=1 as undefined As I said, for N=0 I agree with you that the result should be "undefined/null/NaN" (and if that means that it's excluded from the response to be consistent with the existing behavior in {{json.facet}} then so be it) ... but i'm a big "-1" (vote, i mean, not math) on treating stddev(N=1) as "undefined" ... that makes no sense to me. For a singleton set, the stddev() should *absolutely* be "0" -- all of the value(s) in the set are identical, the amount of deviation between the value(s) in set is "none". For the purpose of comparing the "consistency" of this set to any other sets, you know that this set is as consistent as it can possibly be. Why sould the {{stddv(\[42]}}} be any different then the {{stddev(\[42,42,42,42,42,])}} bq. Oh, and whatever treatment we give stddev(), we should presumably give to variance()? I would asssume so, but first i'd have to go refresh my memory on how exactly variance differs from stddev :) > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations doesn't tend to > differ significantly when count is "very large" > # _"Corrected sample stddev" is more appropriate when comparing two > distributions_ > Given that: > * the primary usage of computing the stddev of a field/function against a > Solr
[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula
[ https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281837#comment-16281837 ] Yonik Seeley commented on SOLR-11725: - +1 for changing... "N-1" is the more standard form. bq. Attaching a trivial patch containing the change Hoss spelled out above. Note that the accumulator needs to be changed as well for non-distributed stddev calculation. The Merger is not used in that case. This does bring up the question of what to do when N=1 (or N=0 for that matter). Standard deviation of a population of N=1 is 0, but of a sample of N=1 is undefined (or infinity?) When N=0, the current code produces 0, but I don't think that's the best choice. In general we've been moving toward omitting undefined functions. Stats like min() and max() already do this. TestJsonFacets has this: {code} // stats at top level, matching documents, but no values in the field // NOTE: this represents the current state of what is returned, not the ultimate desired state. client.testJQ(params(p, "q", "id:3" , "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" + ", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" + ", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" + ", med:'percentile(${num_d},50)', perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', stddev:'stddev(${num_d})' }" ) , "facets=={count:1 " + ",sum1:0.0," + " sumsq1:0.0," + " avg1:0.0," + // TODO: undesirable. omit? // " min1:'NaN'," + // " max1:'NaN'," + " numwhere:0," + " unique_num_i:0," + " unique_num_d:0," + " unique_date:0," + " where_hll:0," + " hll_num_i:0," + " hll_num_d:0," + " hll_date:0," + " variance:0.0," + " stddev:0.0" + " }" ); {code} I'd be tempted to treat N=0 and N=1 as undefined, and omit them. Note that we need to be careful to have the N=1 case still contribute to a distributed bucket, even if it's undefined locally! In the distributed case, N=0 is normally handled generically for anything that doesn't produce a result (they are "missing"/null and are sorted after anything that has a value). Things may work if we make getDouble() return 0 (for sorting), but then override getMergedResult() to return null when N <= 1. Oh, and whatever treatment we give stddev(), we should presumably give to variance()? > json.facet's stddev() function should be changed to use the "Corrected sample > stddev" formula > - > > Key: SOLR-11725 > URL: https://issues.apache.org/jira/browse/SOLR-11725 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man > Attachments: SOLR-11725.patch > > > While working on some equivalence tests/demonstrations for > {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} > calculations done between the two code paths can be measurably different, and > realized this is due to them using very different code... > * {{json.facet=foo:stddev(foo)}} > ** {{StddevAgg.java}} > ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}} > * {{stats.field=\{!stddev=true\}foo}} > ** {{StatsValuesFactory.java}} > ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - > 1.0D)))}} > Since I"m not really a math guy, I consulting with a bunch of smart math/stat > nerds I know online to help me sanity check if these equations (some how) > reduced to eachother (In which case the discrepancies I was seeing in my > results might have just been due to the order of intermediate operation > execution & floating point rounding differences). > They confirmed that the two bits of code are _not_ equivalent to each other, > and explained that the code JSON Faceting is using is equivalent to the > "Uncorrected sample stddev" formula, while StatsComponent's code is > equivalent to the the "Corrected sample stddev" formula... > https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation > When I told them that stuff like this is why no one likes mathematicians and > pressed them to explain which one was the "most canonical" (or "most > generally applicable" or "best") definition of stddev, I was told that: > # This is something statisticians frequently disagree on > # Practically speaking the diff between the calculations