[
https://issues.apache.org/jira/browse/DERBY-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Hillegas updated DERBY-5466:
---------------------------------
Attachment: var_pop_formulas.txt
Thanks for the patch, Scott. Your factoring of the hard part into the
population variance class (VarPAggregator, var_pop()) is very elegant.
It took me a while to refresh my memory of what these functions do. Other
people may appreciate the description of these functions found here:
http://www.macroption.com/population-sample-variance-standard-deviation/
Googling around, I found several formulas for calculating population variance.
The easiest one to understand is the Sybase approach:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1520/html/iqrefbb/CHDGJBBI.htm
This is the Sybase formula:
1/n * sum( (xi - m)(xi - m) )
where
m is the mean
n is the number of items in the population
and xi (for i = 1...n) are the items in the population
IBM gives another formula
(http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_71/db2/rbafzcolvar.htm):
sum( xi * xi )/n - m*m
It took me a while to convince myself that the formulas are equivalent. For
future reference, I have captured my reasoning in the attached
var_pop_formulas.txt file.
I can't provide any arguments for whether one of these formulas is more
accurate than the other in terms of not losing precision as the values
accumulate. However, the IBM formula does not expose Derby to out-of-memory
errors while building an ArrayList which may come to hold hundreds of millions
of values.
I notice that you are using the Sybase formula. I recommend reworking
VarPAggregator to use the IBM formula, instead.
Thanks,
-Rick
> Add support for SQL Standard statistics functions, such as STDDEV_POP,
> STDDEV_SAMP, VAR_POP, VAR_SAMP
> -----------------------------------------------------------------------------------------------------
>
> Key: DERBY-5466
> URL: https://issues.apache.org/jira/browse/DERBY-5466
> Project: Derby
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 10.8.1.2
> Reporter: Lukas Eder
> Priority: Minor
> Labels: derby_triage10_10
> Attachments: DERBY-5466.stats.patch, var_pop_formulas.txt
>
>
> Any of these RDBMS support the SQL standard statistics functions STDDEV_POP,
> STDDEV_SAMP, VAR_POP, VAR_SAMP:
> - DB2 (only STDDEV, VARIANE)
> - H2
> - HSQLDB
> - Ingres
> - MySQL
> - Oracle
> - Postgres
> - SQL Server (named STDEVP, STDEV, VARP, VAR)
> - Sybase ASE
> - Sybase SQL Anywhere
> These don't:
> - Derby
> - SQLite
> This would be a useful addition for Derby, I think.
> An even larger example list of possible statistics aggregate functions is
> listed in the Postgres documentation:
> http://www.postgresql.org/docs/9.0/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)