Re: [jupyter] Good solution for sharing largish data sets?

Jason Grout Thu, 15 Mar 2018 12:45:51 -0700

There was a guest post on the Jupyter blog the other day about Quilt, which
may be interesting for you to look at:
https://blog.jupyter.org/reproducible-data-dependencies-for-python-guest-post-d0f68293a99


Jason


On Thu, Mar 15, 2018 at 12:34 PM Matthew Turk <[email protected]> wrote:

> Hi!
>
> Great question.  My name's Matt Turk and along with some other folks
> (lurking?) on this list I work on a project called Whole Tale.  We
> just had an overview paper published (gold OA) at
> https://doi.org/10.1016/j.future.2017.12.029 that gives some
> architectural information, but the gist is that we're trying to solve
> that exact problem.  Our website isn't the best, and we're not
> confident of a stable, running instance until early summer (I bet if
> you logged in you could find ways to break it or prickly bits in the
> UI), but you can find a bit more at wholetale.org and
> github.com/whole-tale .  You could even launch your own instance,
> should you want to.
>
> The long and the short of it is that we run docker containers (not
> only Jupyter, but it's currently used as one of the defaults) with
> computational environments and "inject" data through a handcrafted
> FUSE fs.
>
> The ultimate location of the data is not important (can be both local
> or remote), as long as you provide a valid uri containing both
> location and transfer protocol (e.g. 'http://example.com/file',
> 'globus:/endpoint/foo/bar'). There's a couple of additional attributes
> you need to provide (size & name, although over HTTP sometimes we can
> get these). We keep track of all of those using an external db
> (MongoDB via Girder) which is subsequently used by FUSE to resolve
> OS-level IO calls into appropriate requests for data. For example,
> when you open() a file that's registered as a 'http://' url, it will
> (invisibly) locally cache it and present it as though it were local.
>
> Kacper Kowalik, our software architect, recently gave a presentation
> on it that you can see here that might be of interest:
> http://use.yt/upload/c8236396 .
>
> I'd be happy to share more here or offline, too, but this is something
> we're working on pretty hard and while we have a ways to go --
> especially in smoothing things out from a UI/UX perspective and
> getting stability of the platform, we're working hard on it and really
> want to engage much more deeply with folks throughout the community.
>
> -Matt, on behalf of the Whole Tale team
>
> On Thu, Mar 15, 2018 at 1:06 PM, 'Aaron Watters' via Project Jupyter
> <[email protected]> wrote:
> > Hi folks,
> >
> > I'm interested in techniques for sharing data in scientific workflows.
> > Tools like git/github and docker/repo2docker are great for sharing
> > computational
> > environments and moderate sized data, but not good for sharing (say)
> > hundreds of gigabytes of data.  What do people do?
> >
> > I have in mind something like this: a scientist on a good network
> > spins up a jupyter server in a Docker container containing a workflow
> > using github and repo2docker.  In the container s/he provides some
> > authorization credentials and data for the workflow appears in the
> > container if the credentials are valid, maybe with read/write access
> > of some sort if the credentials are really good.
> >
> > If we are interested in provided publically accessible data in read
> > only mode we could just dump the data to a web server anywhere
> > and pull it down using HTTP,
> >
> > I don't know the right way to do this if we want to have limited access
> > to the data and sometime provide the ability to write the data.
> >
> > I'm also interested in the case where the scientist is remote --
> > ie, certain people are allowed to use our compute cluster possible
> > with data they have locally or with other data out there somewhere...
> >
> > Any and all thoughts or pointers appreciated.  Thanks!
> > Sorry if the question is silly or too vague.
> >
> >    -- Aaron Watters
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Project Jupyter" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To post to this group, send email to [email protected].
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/jupyter/003e34fa-a547-40c5-a617-8997ee5db326%40googlegroups.com
> .
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jupyter/CALO3%3D5HfjR69tOPB37pkCQo4yiWftsGVfVvzAOhyJnaDn6a3cQ%40mail.gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jupyter/CAPDWZHxV1cV6MUSrmLuHc-QzM2H_8VPxDj4QWLzeCuzd%3DsWuRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [jupyter] Good solution for sharing largish data sets?

Reply via email to