hey olivier,
although i won't be going to pycon, i think this is a great direction.
however, i think ascertaining the scope of what you want to solve/achieve
is important.
for a lot of practical use cases in domains that i'm familiar with, the
parallelization at the algorithmic level helps minimally because of the
size of the data and the ability to access/move this data over the network.
in fact for a particular international distributed cluster, multiple
algorithms were being pipelined on a single core (i.e. that was the atomic
unit) to prevent large data movement.
i think algorithmic parallelization and efficiency is important and it
might just be good to get some feedback on use cases so that the
infrastructure does not constrain one to problems of a specific type or in
specific domains. i think a wiki or an issue thread is a good start.
cheers,
satra
On Wed, Dec 7, 2011 at 9:28 AM, Olivier Grisel <olivier.gri...@ensta.org>wrote:
> 2011/12/7 Timmy Wilson <tim...@smarttypes.org>:
> > I would love to sit in, and learn, and contribute where i can.
> >
> > Probably won't have time for this during the sprint -- but i want to
> > throw it out there:
> >
> > The importance of locality in many manifold learning algos them good
> > candidates for distribution.
>
> This is interesting but AFAIK there is no established way to achieve
> this and this is still an open research problem.
>
> Personally I don't plan to work on panellization of machine learning
> algorithms it-self during this sprint but focus more on the
> infrastructure. Although to make informed decisions on the infra it's
> good to have some motivating and representative use cases in mind that
> can be used to validate proof of concepts implementations.
>
> For instance: in the machine learning domain (scikit-learn) we could have:
> - sparse coding with a fixed dictionary (embarrassingly parallel)
> - distributed fitting a of linear model with SGD & averaging (can be
> implemented efficiently with message passing I think).
>
> In the general data-analytics domain (Pandas & statsmodels):
> - distributed (& streaming) computation of means, variances and other
> moments.
> - distributed implementation of the alignement 2 datasets (2d tables)
> using a common key: e.g. the timecode in a time series.
> - distributed implementation of the GroupBy feature of Pandas
>
> Also speaking about scaling machine learning algorithm, the following
> blog post titled "Hadoop AllReduce and Terascale Learning" by John
> Langord is very interesting:
>
> http://hunch.net/?p=2094
>
> Maybe we should open a wikipage for sprint planning. Fernando shall we
> use the IPython wiki on github (if so please enable it)? Otherwise we
> can use the scikit-learn wiki that we regularly use for sprint
> planning, e.g.:
> https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point
> of
> discussion for anyone considering optimizing the pricing and packaging
> model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general