[ 
https://issues.apache.org/jira/browse/DERBY-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Hillegas updated DERBY-5466:
---------------------------------
    Attachment: var_pop_formulas.txt

Thanks for the patch, Scott. Your factoring of the hard part into the 
population variance class (VarPAggregator, var_pop()) is very elegant.

It took me a while to refresh my memory of what these functions do. Other 
people may appreciate the description of these functions found here: 
http://www.macroption.com/population-sample-variance-standard-deviation/

Googling around, I found several formulas for calculating population variance. 
The easiest one to understand is the Sybase approach: 
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1520/html/iqrefbb/CHDGJBBI.htm
 This is the Sybase formula:

  1/n * sum( (xi - m)(xi - m) )

  where

    m is the mean
    n is the number of items in the population
    and xi (for i = 1...n) are the items in the population

IBM gives another formula 
(http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_71/db2/rbafzcolvar.htm):

  sum( xi * xi )/n - m*m

It took me a while to convince myself that the formulas are equivalent. For 
future reference, I have captured my reasoning in the attached 
var_pop_formulas.txt file.

I can't provide any arguments for whether one of these formulas is more 
accurate than the other in terms of not losing precision as the values 
accumulate. However, the IBM formula does not expose Derby to out-of-memory 
errors while building an ArrayList which may come to hold hundreds of millions 
of values.

I notice that you are using the Sybase formula. I recommend reworking 
VarPAggregator to use the IBM formula, instead.

Thanks,
-Rick


> Add support for SQL Standard statistics functions, such as STDDEV_POP, 
> STDDEV_SAMP, VAR_POP, VAR_SAMP
> -----------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-5466
>                 URL: https://issues.apache.org/jira/browse/DERBY-5466
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.8.1.2
>            Reporter: Lukas Eder
>            Priority: Minor
>              Labels: derby_triage10_10
>         Attachments: DERBY-5466.stats.patch, var_pop_formulas.txt
>
>
> Any of these RDBMS support the SQL standard statistics functions STDDEV_POP, 
> STDDEV_SAMP, VAR_POP, VAR_SAMP:
> - DB2 (only STDDEV, VARIANE)
> - H2 
> - HSQLDB 
> - Ingres 
> - MySQL 
> - Oracle 
> - Postgres 
> - SQL Server (named STDEVP, STDEV, VARP, VAR)
> - Sybase ASE
> - Sybase SQL Anywhere
> These don't:
> - Derby
> - SQLite
> This would be a useful addition for Derby, I think.
> An even larger example list of possible statistics aggregate functions is 
> listed in the Postgres documentation:
> http://www.postgresql.org/docs/9.0/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to