Hi! I'm currently working on a rather large-scale dataset (~300M samples represented as dense vectors of cardinality ~100). The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs, including heavy usage of Mahout (Lanczos decomposition, clustering, etc).
I'm now looking for ways to learn a logistic regression model based on the data. So far I postponed this part of the project, hoping for MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be ready... but unfortunately I can't afford to wait any more :) Looking around, I've found Google's sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley Hadoop-based implementation<http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide> . Anyone has experience with these, or knows of / used a good library for logistic regressions of this scale? Thanks, Danny