Hi Satra.
Thanks for your comments.
Can you explain what the "grap an engine" strategy means?
Is it that you distribute the jobs to the engines before starting any jobs and not having them in a queue?
This should be ok if my jobs and my engines are pretty homogeneous, right?

The main question for me is if there is an easy, non-intrusive way to let my interface interact with sklearn.

As far as I can tell, the work that is planned for the sprint is much more "low level" than what joblib is doing at the moment.
There are two use cases for me: model selection and ensemble learning.
For both of them, parallelisation should be fairly easy.

I do have a shared file system and some engines might share memory but I don't really want to go there. The runtime of my jobs is much longer than memory transfer over the network would be, so I think
I'd be happy with just using the IPython library.

Unfortunately, I can't come to Pycon. And I need a working solution before that ;) Well I actually have a working solution for forests but it's not good enough ^^

Cheers,
Andy



On 01/27/2012 04:00 PM, Satrajit Ghosh wrote:
hi andreas,

a few notes:

- a sprint planned for pycon will be looking at parallel computing with scikit-learn and ipython (http://wiki.ipython.org/PyCon12Sprint)

- ipython currently uses a grab an engine and not release strategy in the context of distributed systems like SGE/PBS/LSF. this implies that the load distribution happens at engine instantiation time not at execution time. depending on your cluster this may be a positive or a negative thing.

- in nipype we do distributed computing by offering the ability to use ipython as the point of distribution or directly interfacing with the cluster engine. here is the ipython plugin:

https://github.com/nipy/nipype/blob/master/nipype/pipeline/plugins/ipythonxi.py

- there is also a python library called the soma workflow that offers a python interface to distributed computing using drmaa.

the key decision point for which route will depend on how the data gets to the compute node (whether by files, or pickling, or shared memory), whether the file system is shared or whether the data movement is done between processes.

cheers,

satra


On Fri, Jan 27, 2012 at 9:44 AM, Andreas <[email protected] <mailto:[email protected]>> wrote:

    Hi everybody.
    This question basically goes out to Gael, but might also be
    interesting
    for others.
    I am using sklearn on an SGE cluster at the moment and it is not
    as nice
    as it could be.
    So I was wondering whether there would be a non-intrusive way to make
    sklearn
    parallelize over the cluster.
    At the moment all parallelism is handled by joblib. On the other
    hand it
    seems
    IPython can talk to the SGE scheduling.
    So I would love to have a way for joblib to talk to IPython.

    Is there an easy way to make this possible?
    I was thinking about monkey-patching the Parallel class to use
    "LoadBalancedView" from IPython.
    Do you think this is feasible?

    Another question is whether there are additional assumptions made
    by sklearn about the way the parallelism works.
    IPython basically provides a "map" interface similar to "Parallel",
    so I would hope that there are no problems. Do you think there
    will be?

    Any help would be welcome.

    If I actually get this to work, I feel this might be quite a success
    story for sklearn ;)

    Cheers,
    Andy

    
------------------------------------------------------------------------------
    Try before you buy = See our experts in action!
    The most comprehensive online learning library for Microsoft
    developers
    is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3,
    MVC3,
    Metro Style Apps, more. Free future releases when you subscribe now!
    http://p.sf.net/sfu/learndevnow-dev2
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to