Hi Anders,
>> The problem as I see it is the "tearing it down" bit, I don't want the
jobs shutting down before the user has had a chance to get the resulting
data, but I suspect if we let users shut them down themselfes a lot of them
will sit around for no reason.
With Amazon EMR you read and write data from and to S3. While the EMR jobs
tears down the cluster (by default, you can keep it on if you prefer) on
completion, the result data can be picked up later using an FTP like
process (s3cmd get). Since you are asking on the sklearn ML, you are
probably not using EMR, but you can modify your job script to do something
similar, ie write the result back to S3 before it completes and causes the
cluster to shut down.
-sujit
On Mon, Sep 1, 2014 at 4:34 AM, Olivier Grisel <[email protected]>
wrote:
> 2014-09-01 12:39 GMT+02:00 Anders Aagaard <[email protected]>:
> > Data sync is a very good point.. and will vary greatly depending on how
> we
> > set things up. If we do a single major server thing we can probably get
> > people to scp things in, if we use containers that are started up and
> killed
> > off on VM's that's not really a good option.
> >
> > I've used reverse sshfs (mounting a local directory into a directory on
> the
> > host) with success, but that's a fairly platform specific solution, and
> > won't really work for a lot of the consumers...
> >
> > Another important point is data safety really. We're doing dumps of
> massive
> > amounts of company data, I'd prefer if that data wasn't available on any
> > laptops. The python code can (and should be) available, but the entire
> data
> > dumps would be nice if were kept as safe as possible. (while still of
> course
> > granting developers raw access to it)
>
> This is why I recommend to use the cloud storage for that, with a
> local working copy synced in the container at startup and shutdown.
>
> Cloud storage services like Amazon S3, Google Storage, Rackspace Cloud
> Files and Azure Blob Store allow for highly concurrent access to
> replicated data with optional access control policies, encryption and
> automated replication (potentially cross-datacenters).
>
> Furthermore the connection between cloud compute and cloud storage
> (e.g. Amazon EC2 to S3) can support high throughput via concurrent
> calls.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds. Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general