2011/12/7 Timmy Wilson <tim...@smarttypes.org>: > I would love to sit in, and learn, and contribute where i can. > > Probably won't have time for this during the sprint -- but i want to > throw it out there: > > The importance of locality in many manifold learning algos them good > candidates for distribution.
This is interesting but AFAIK there is no established way to achieve this and this is still an open research problem. Personally I don't plan to work on panellization of machine learning algorithms it-self during this sprint but focus more on the infrastructure. Although to make informed decisions on the infra it's good to have some motivating and representative use cases in mind that can be used to validate proof of concepts implementations. For instance: in the machine learning domain (scikit-learn) we could have: - sparse coding with a fixed dictionary (embarrassingly parallel) - distributed fitting a of linear model with SGD & averaging (can be implemented efficiently with message passing I think). In the general data-analytics domain (Pandas & statsmodels): - distributed (& streaming) computation of means, variances and other moments. - distributed implementation of the alignement 2 datasets (2d tables) using a common key: e.g. the timecode in a time series. - distributed implementation of the GroupBy feature of Pandas Also speaking about scaling machine learning algorithm, the following blog post titled "Hadoop AllReduce and Terascale Learning" by John Langord is very interesting: http://hunch.net/?p=2094 Maybe we should open a wikipage for sprint planning. Fernando shall we use the IPython wiki on github (if so please enable it)? Otherwise we can use the scikit-learn wiki that we regularly use for sprint planning, e.g.: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general