There was a guest post on the Jupyter blog the other day about Quilt, which may be interesting for you to look at: https://blog.jupyter.org/reproducible-data-dependencies-for-python-guest-post-d0f68293a99
Jason On Thu, Mar 15, 2018 at 12:34 PM Matthew Turk <[email protected]> wrote: > Hi! > > Great question. My name's Matt Turk and along with some other folks > (lurking?) on this list I work on a project called Whole Tale. We > just had an overview paper published (gold OA) at > https://doi.org/10.1016/j.future.2017.12.029 that gives some > architectural information, but the gist is that we're trying to solve > that exact problem. Our website isn't the best, and we're not > confident of a stable, running instance until early summer (I bet if > you logged in you could find ways to break it or prickly bits in the > UI), but you can find a bit more at wholetale.org and > github.com/whole-tale . You could even launch your own instance, > should you want to. > > The long and the short of it is that we run docker containers (not > only Jupyter, but it's currently used as one of the defaults) with > computational environments and "inject" data through a handcrafted > FUSE fs. > > The ultimate location of the data is not important (can be both local > or remote), as long as you provide a valid uri containing both > location and transfer protocol (e.g. 'http://example.com/file', > 'globus:/endpoint/foo/bar'). There's a couple of additional attributes > you need to provide (size & name, although over HTTP sometimes we can > get these). We keep track of all of those using an external db > (MongoDB via Girder) which is subsequently used by FUSE to resolve > OS-level IO calls into appropriate requests for data. For example, > when you open() a file that's registered as a 'http://' url, it will > (invisibly) locally cache it and present it as though it were local. > > Kacper Kowalik, our software architect, recently gave a presentation > on it that you can see here that might be of interest: > http://use.yt/upload/c8236396 . > > I'd be happy to share more here or offline, too, but this is something > we're working on pretty hard and while we have a ways to go -- > especially in smoothing things out from a UI/UX perspective and > getting stability of the platform, we're working hard on it and really > want to engage much more deeply with folks throughout the community. > > -Matt, on behalf of the Whole Tale team > > On Thu, Mar 15, 2018 at 1:06 PM, 'Aaron Watters' via Project Jupyter > <[email protected]> wrote: > > Hi folks, > > > > I'm interested in techniques for sharing data in scientific workflows. > > Tools like git/github and docker/repo2docker are great for sharing > > computational > > environments and moderate sized data, but not good for sharing (say) > > hundreds of gigabytes of data. What do people do? > > > > I have in mind something like this: a scientist on a good network > > spins up a jupyter server in a Docker container containing a workflow > > using github and repo2docker. In the container s/he provides some > > authorization credentials and data for the workflow appears in the > > container if the credentials are valid, maybe with read/write access > > of some sort if the credentials are really good. > > > > If we are interested in provided publically accessible data in read > > only mode we could just dump the data to a web server anywhere > > and pull it down using HTTP, > > > > I don't know the right way to do this if we want to have limited access > > to the data and sometime provide the ability to write the data. > > > > I'm also interested in the case where the scientist is remote -- > > ie, certain people are allowed to use our compute cluster possible > > with data they have locally or with other data out there somewhere... > > > > Any and all thoughts or pointers appreciated. Thanks! > > Sorry if the question is silly or too vague. > > > > -- Aaron Watters > > > > -- > > You received this message because you are subscribed to the Google Groups > > "Project Jupyter" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > To post to this group, send email to [email protected]. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/jupyter/003e34fa-a547-40c5-a617-8997ee5db326%40googlegroups.com > . > > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "Project Jupyter" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/jupyter/CALO3%3D5HfjR69tOPB37pkCQo4yiWftsGVfVvzAOhyJnaDn6a3cQ%40mail.gmail.com > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Project Jupyter" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/CAPDWZHxV1cV6MUSrmLuHc-QzM2H_8VPxDj4QWLzeCuzd%3DsWuRw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
