Mahout has this. We have an LSMR implementation that can accept a generic linear operator. You can implement this linear operator as an out of core multiplication or as a cluster operation.
You don't say how large you want the system to be or whether you have sparse data. That might change the answer. See http://www.stanford.edu/group/SOL/software/lsmr.html On Fri, Jun 24, 2011 at 11:44 AM, Greg Sterijevski <gsterijev...@gmail.com>wrote: > Hello All, > > I have been a user of the math commons jar for a little over a year and am > very impressed with it. I was wondering whether anyone is actively working > on implementing functionality to do regressions on very very large data > sets. The current implementation of the OLS routine is an in-core QR > decomposition with substitution. While the solutions are typically > accurate, > the in-core nature limits the usefulness of these objects. > > Looking through the code, most of the implementation of an InputStream > based > regression routine would respect the contract implicit in the interface > MultipleLinearRegression. However, large regression problems are important > enough that there should be a way to: > > 1. Wrap a potentially large data source, perhaps as an InputStream of some > sort. > 2. Have a separate contract with methods like clear() ( to clear whatever > intermediate calculations are stored), and regress() which generates > immutable results that are not affected by further updates of the data. > > I would appreciate any thoughts or comments, as well suggestions about > functionality already in math commons which might address some points I > raised. > > Thank you, > > -Greg >