[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2017-12-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305994#comment-16305994
 ] 

Yonik Seeley commented on SOLR-11725:
-

{quote}
> ...In general we've been moving toward omitting undefined functions. Stats 
> like min() and max() already do this.
Whoa... really? ... that seems like it would make th client parsing realy 
hard...
{quote}

Trying to remember.  I *think* it may have just worked out that way originally 
when null is returned as the value from SlotAcc.getValue()
And I may have also conflated "empty bucket" with "stat over no values".  I'm 
not sure if client parsing is really much harder since a map interface of 
bucket.get("mystat") would return null in both cases.
On the other hand, I can see how it could be confusing to request a stat and 
not see it at all in the response.  Overall I guess I'm leaning toward 
returning "mystat":null for a non-empty bucket where mystat has no value / 
undefined value.

bq. For a singleton set, the stddev() should absolutely be "0"

Standard deviation of a population of size 1, yes. But this issue was about 
switching to standard deviation of samples, and that is undefined (or infinite) 
for a single sample.
Python throws an exception: 
https://docs.python.org/3/library/statistics.html#statistics.stdev
Google sheets will return a div-by-0 error: 
https://support.google.com/docs/answer/3094054?hl=en
Excel also gives a div-by-0 error with a single value.  I can't find anything 
using the "N-1" variant that uses 0 for a single sample.



> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2017-12-12 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288090#comment-16288090
 ] 

ASF subversion and git services commented on SOLR-11725:


Commit 2990c88a927213177483b61fe8e6971df04fc3ed in lucene-solr's branch 
refs/heads/master from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2990c88 ]

Beef up testing of json.facet 'refine:simple' when dealing with 'Long Tail' 
terms

In an attempt to get more familiar with json.facet refinement, I set out to try 
and refactor/generalize/clone
some of the existing facet.pivot refinement tests to assert that json.facet 
could produce the same results.
This test is a baby step towards doing that: Cloning 
DistributedFacetPivotLongTailTest into
DistributedFacetSimpleRefinementLongTailTest (with shared index building code).

Along the way, I learned that the core logic of 'refine:simple' is actually 
quite different then how facet.field
& facet.pivot work (see discussion in SOLR-11733), so they do *NOT* produce the 
same results in many "Long Tail"
Sitautions.  As a result, many of the logic/assertions 
inDistributedFacetSimpleRefinementLongTailTest are very
differnet then their counter parts in DistributedFacetPivotLongTailTest, with 
detailed explanations in comments.

Hopefully this test will prove useful down the road to anyone who might want to 
compare/contrast facet.pivot
with json.facet, and to prevent regressions in 'refine:simple' if/when we add 
more complex refinement
approaches in the future.

There are also a few TODOs in the test related to some other small 
discrepencies between json.facet and
stats.field that I opened along the way, indicating where the tests should be 
modified once those issues are
addressed in json.facet...

 - SOLR-11706: support for multivalued numeric fields in stats
 - SOLR-11695: support for 'missing()' & 'num_vals()' (aka: 'count' from 
stats.field) numeric stats
 - SOLR-11725: switch from 'uncorrected stddev' to 'corrected stddev'


> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message 

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2017-12-12 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288086#comment-16288086
 ] 

ASF subversion and git services commented on SOLR-11725:


Commit 53f2d4aa3aa171d5f37284eba9ca56d987729796 in lucene-solr's branch 
refs/heads/branch_7x from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=53f2d4a ]

Beef up testing of json.facet 'refine:simple' when dealing with 'Long Tail' 
terms

In an attempt to get more familiar with json.facet refinement, I set out to try 
and refactor/generalize/clone
some of the existing facet.pivot refinement tests to assert that json.facet 
could produce the same results.
This test is a baby step towards doing that: Cloning 
DistributedFacetPivotLongTailTest into
DistributedFacetSimpleRefinementLongTailTest (with shared index building code).

Along the way, I learned that the core logic of 'refine:simple' is actually 
quite different then how facet.field
& facet.pivot work (see discussion in SOLR-11733), so they do *NOT* produce the 
same results in many "Long Tail"
Sitautions.  As a result, many of the logic/assertions 
inDistributedFacetSimpleRefinementLongTailTest are very
differnet then their counter parts in DistributedFacetPivotLongTailTest, with 
detailed explanations in comments.

Hopefully this test will prove useful down the road to anyone who might want to 
compare/contrast facet.pivot
with json.facet, and to prevent regressions in 'refine:simple' if/when we add 
more complex refinement
approaches in the future.

There are also a few TODOs in the test related to some other small 
discrepencies between json.facet and
stats.field that I opened along the way, indicating where the tests should be 
modified once those issues are
addressed in json.facet...

 - SOLR-11706: support for multivalued numeric fields in stats
 - SOLR-11695: support for 'missing()' & 'num_vals()' (aka: 'count' from 
stats.field) numeric stats
 - SOLR-11725: switch from 'uncorrected stddev' to 'corrected stddev'

(cherry picked from commit 2990c88a927213177483b61fe8e6971df04fc3ed)


> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be 

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2017-12-07 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282354#comment-16282354
 ] 

Hoss Man commented on SOLR-11725:
-



bq. This does bring up the question of what to do when N=1 (or N=0 for that 
matter).

I ommitted them from my original description for brevity to focus on the bigger 
picture of the equations, but for the record the full implemetnion of stddev in 
each of the two classes mentioned are...

* {{StddevAgg.java}}: {code}
double val = count == 0 ? 0.0d : Math.sqrt((sumSq/count)-Math.pow(sum/count, 
2));
return val;
{code}
* {{StatsValuesFactory.java}}: {code}
if (count <= 1.0D) {
  return 0.0D;
}

return Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
1.0D)));
{code}


bq. When N=0, the current code produces 0, but I don't think that's the best 
choice. ...

Agreed, it should really be 'null' (or 'NaN')

(i'm not sure why {{StatsValuesFactory.java}} currently returns {{0.0D}} when 
{{count==0}} ... other {{StatsValuesFactory.java}} stats like min/max correctly 
return 'null' ... it's weird)

bq. ...In general we've been moving toward omitting undefined functions. Stats 
like min() and max() already do this.

Whoa... really? ... that seems like it would make th client parsing realy 
hard...

You're saying users can't expect that every "facet" key they specify in the 
request will be include in the response? (in the event it's 'null' or 'NaN' or 
whatever makes sense given it's data type)  Why???

bq. I'd be tempted to treat N=0 and N=1 as undefined

As I said, for N=0 I agree with you that the result should be 
"undefined/null/NaN" (and if that means that it's excluded from the response to 
be consistent with the existing behavior in {{json.facet}} then so be it) ... 
but i'm a big "-1" (vote, i mean, not math) on treating stddev(N=1) as 
"undefined" ... that makes no sense to me.  

For a singleton set, the stddev() should *absolutely* be "0" -- all of the 
value(s) in the set are identical, the amount of deviation between the value(s) 
in set is "none".  For the purpose of comparing the "consistency" of this set 
to any other sets, you know that this set is as consistent as it can possibly 
be.

Why sould the {{stddv(\[42]}}} be any different then the 
{{stddev(\[42,42,42,42,42,])}} 

bq. Oh, and whatever treatment we give stddev(), we should presumably give to 
variance()?

I would asssume so, but first i'd have to go refresh my memory on how exactly 
variance differs from stddev :)




> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr 

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2017-12-07 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281837#comment-16281837
 ] 

Yonik Seeley commented on SOLR-11725:
-

+1 for changing... "N-1" is the more standard form.

bq. Attaching a trivial patch containing the change Hoss spelled out above.
Note that the accumulator needs to be changed as well for non-distributed 
stddev calculation.  The Merger is not used in that case.

This does bring up the question of what to do when N=1 (or N=0 for that matter).
Standard deviation of a population of N=1 is 0, but of a sample of N=1 is 
undefined (or infinity?)

When N=0, the current code produces 0, but I don't think that's the best choice.
In general we've been moving toward omitting undefined functions.  Stats like 
min() and max() already do this.

TestJsonFacets has this:
{code}
// stats at top level, matching documents, but no values in the field
// NOTE: this represents the current state of what is returned, not the 
ultimate desired state.
client.testJQ(params(p, "q", "id:3"
, "json.facet", "{ sum1:'sum(${num_d})', sumsq1:'sumsq(${num_d})', 
avg1:'avg(${num_d})', min1:'min(${num_d})', max1:'max(${num_d})'" +
", numwhere:'unique(${where_s})', unique_num_i:'unique(${num_i})', 
unique_num_d:'unique(${num_d})', unique_date:'unique(${date})'" +
", where_hll:'hll(${where_s})', hll_num_i:'hll(${num_i})', 
hll_num_d:'hll(${num_d})', hll_date:'hll(${date})'" +
", med:'percentile(${num_d},50)', 
perc:'percentile(${num_d},0,50.0,100)', variance:'variance(${num_d})', 
stddev:'stddev(${num_d})' }"
)
, "facets=={count:1 " +
",sum1:0.0," +
" sumsq1:0.0," +
" avg1:0.0," +   // TODO: undesirable. omit?
// " min1:'NaN'," +
// " max1:'NaN'," +
" numwhere:0," +
" unique_num_i:0," +
" unique_num_d:0," +
" unique_date:0," +
" where_hll:0," +
" hll_num_i:0," +
" hll_num_d:0," +
" hll_date:0," +
" variance:0.0," +
" stddev:0.0" +
" }"
);
{code}

I'd be tempted to treat N=0 and N=1 as undefined, and omit them.  Note that we 
need to be careful to have the N=1 case still contribute to a distributed 
bucket, even if it's undefined locally!
In the distributed case, N=0 is normally handled generically for anything that 
doesn't produce a result (they are "missing"/null and are sorted after anything 
that has a value).  Things may work if we make getDouble() return 0 (for 
sorting), but then override getMergedResult() to return null when N <= 1.

Oh, and whatever treatment we give stddev(), we should presumably give to 
variance()?



> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
> Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations