Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-23 Thread Hector Yee
Yes sorry the enron email corpus
On Apr 21, 2012 4:32 PM, "Lance Norskog"  wrote:

> Hector, perhaps you meant my former employer Enron?
>
> On Sat, Apr 21, 2012 at 4:20 AM, Grant Ingersoll 
> wrote:
> >
> > On Apr 20, 2012, at 12:05 PM, Hector Yee wrote:
> >
> >> On a related note, wish i could share the data i have to see how these
> >> algorithms stack up to the ones we use for large scale learning.
> >
> > That certainly would be interesting.
> >
> >>
> >> Are there other examples of large data sets people use? I know there's
> the
> >> Exxon
> >
> > Pointer?  Haven't seen that one.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


[jira] [Commented] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

2012-04-23 Thread Dev Lakhani (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259847#comment-13259847
 ] 

Dev Lakhani commented on MAHOUT-1000:
-

I guess this was a naive attempt at trying to create a MR version of the Apache 
commons math/statistics package. Following this implementation, the idea is to 
go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, 
Kolmogrov-Smirnov and other R like features (but in MR).

Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest 
in commons math defines the TDistribution for lookup of statistical values so 
perhaps it's better doing the whole thing in Java. This also makes it easier to 
test and control/tune the MR jobs.

I was just trying to test the waters really and see if there is support for 
this; if so then there are plenty of basic stats tests than can be implemented 
for big data. This will require a bit of help from the community. If not please 
feel free to close this entry.

Cheers



> Implementation of Single Sample T-Test using Map Reduce/Mahout
> --
>
> Key: MAHOUT-1000
> URL: https://issues.apache.org/jira/browse/MAHOUT-1000
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: Backlog
> Environment: Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x
>Reporter: Dev Lakhani
>  Labels: newbie
> Fix For: Backlog
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Implement a map/reduce version of the single sample t test to test whether a 
> sample of n subjects comes from a population in which the mean equals a 
> particular value.
> For a large dataset, say n millions of rows, one can test whether the sample 
> (large as it is) comes from the population mean.
> Input:
> 1) specified population mean to be tested against
> 2) hypothesis direction : i.e. "two.sided", "less", "greater".
> 3) confidence level or alpha
> 4) flag to indicate paired or not paired
> The procedure is as follows:
> 1. Use Map/Reduce to calculate the mean of the sample.
> 2. Use Map/Reduce to calculate standard error of the population mean.
> 3. Use Map/Reduce to calculate the t statistic
> 4. Estimate the degrees of freedom depending on equal sample variances 
> Output
> 1) The value of the t-statistic.
> 2) The p-value for the test.
> 3) Flag that is true if the null hypothesis can be rejected with confidence 1 
> - alpha; false otherwise.
> References
> http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira