[
https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259847#comment-13259847
]
Dev Lakhani commented on MAHOUT-1000:
-
I guess this was a naive attempt at trying to create a MR version of the Apache
commons math/statistics package. Following this implementation, the idea is to
go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations,
Kolmogrov-Smirnov and other R like features (but in MR).
Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest
in commons math defines the TDistribution for lookup of statistical values so
perhaps it's better doing the whole thing in Java. This also makes it easier to
test and control/tune the MR jobs.
I was just trying to test the waters really and see if there is support for
this; if so then there are plenty of basic stats tests than can be implemented
for big data. This will require a bit of help from the community. If not please
feel free to close this entry.
Cheers
> Implementation of Single Sample T-Test using Map Reduce/Mahout
> --
>
> Key: MAHOUT-1000
> URL: https://issues.apache.org/jira/browse/MAHOUT-1000
> Project: Mahout
> Issue Type: New Feature
> Components: Math
>Affects Versions: Backlog
> Environment: Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x
>Reporter: Dev Lakhani
> Labels: newbie
> Fix For: Backlog
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Implement a map/reduce version of the single sample t test to test whether a
> sample of n subjects comes from a population in which the mean equals a
> particular value.
> For a large dataset, say n millions of rows, one can test whether the sample
> (large as it is) comes from the population mean.
> Input:
> 1) specified population mean to be tested against
> 2) hypothesis direction : i.e. "two.sided", "less", "greater".
> 3) confidence level or alpha
> 4) flag to indicate paired or not paired
> The procedure is as follows:
> 1. Use Map/Reduce to calculate the mean of the sample.
> 2. Use Map/Reduce to calculate standard error of the population mean.
> 3. Use Map/Reduce to calculate the t statistic
> 4. Estimate the degrees of freedom depending on equal sample variances
> Output
> 1) The value of the t-statistic.
> 2) The p-value for the test.
> 3) Flag that is true if the null hypothesis can be rejected with confidence 1
> - alpha; false otherwise.
> References
> http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira