On 09/02/2018 18:13, Ludovic Courtès wrote:

Amirouche Boubekki <amirou...@hypermove.net> skribis:

tl;dr: Distribution of data and software seems similar.
        Data is more and more important in software and reproducible
        science. Data science ecosystem lakes resources sharing.
        I think guix can help.

Now, whether Guix is the right tool to distribute data, I don’t know.
Distributing large amounts of data is a job in itself, and the store
isn’t designed for that.  It could quickly become a bottleneck.  That’s
one of the reasons why the Guix Workflow Language (GWL) does not store
scientific data in the store itself.

I'd say it depends on the data and how it is used inside and outside of a workflow. Some data could very well stored in the store, and then distributed via standard channels (Zenodo, ...) after export by "guix pack". For big datasets, some other mechanism is required.

I think it's worth thinking carefully about how to exploit guix for reproducible computations. As Lispers know very well, code is data and data is code. Building a package is a computation like any other. Scientific workflows could be handled by a specific build system. In fact, as long as no big datasets or multiple processors are involved, we can do this right now, using standard package declarations.

It would be nice if big datasets could conceptually be handled in the same way while being stored elsewhere - a bit like git-annex does for git. And for parallel computing, we could have special build daemons.


Reply via email to