That is a good point.. with persistant storage sync this isn't a major
issue.
Thanks a lot for the input so far everyone. I'll probably end up spending
some time on this over the next month or so, if I end up with some
interesting scripts I'll stick them on github.
On Mon, Sep 1, 2014 at 4:34 PM, Sujit Pal <[email protected]> wrote:
> Hi Anders,
>
> >> The problem as I see it is the "tearing it down" bit, I don't want the
> jobs shutting down before the user has had a chance to get the resulting
> data, but I suspect if we let users shut them down themselfes a lot of them
> will sit around for no reason.
> With Amazon EMR you read and write data from and to S3. While the EMR jobs
> tears down the cluster (by default, you can keep it on if you prefer) on
> completion, the result data can be picked up later using an FTP like
> process (s3cmd get). Since you are asking on the sklearn ML, you are
> probably not using EMR, but you can modify your job script to do something
> similar, ie write the result back to S3 before it completes and causes the
> cluster to shut down.
>
> -sujit
>
>
> On Mon, Sep 1, 2014 at 4:34 AM, Olivier Grisel <[email protected]>
> wrote:
>
>> 2014-09-01 12:39 GMT+02:00 Anders Aagaard <[email protected]>:
>> > Data sync is a very good point.. and will vary greatly depending on how
>> we
>> > set things up. If we do a single major server thing we can probably get
>> > people to scp things in, if we use containers that are started up and
>> killed
>> > off on VM's that's not really a good option.
>> >
>> > I've used reverse sshfs (mounting a local directory into a directory on
>> the
>> > host) with success, but that's a fairly platform specific solution, and
>> > won't really work for a lot of the consumers...
>> >
>> > Another important point is data safety really. We're doing dumps of
>> massive
>> > amounts of company data, I'd prefer if that data wasn't available on any
>> > laptops. The python code can (and should be) available, but the entire
>> data
>> > dumps would be nice if were kept as safe as possible. (while still of
>> course
>> > granting developers raw access to it)
>>
>> This is why I recommend to use the cloud storage for that, with a
>> local working copy synced in the container at startup and shutdown.
>>
>> Cloud storage services like Amazon S3, Google Storage, Rackspace Cloud
>> Files and Azure Blob Store allow for highly concurrent access to
>> replicated data with optional access control policies, encryption and
>> automated replication (potentially cross-datacenters).
>>
>> Furthermore the connection between cloud compute and cloud storage
>> (e.g. Amazon EC2 to S3) can support high throughput via concurrent
>> calls.
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds. Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds. Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Mvh
Anders Aagaard
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general