Re: Use guix to distribute data & reproducible (data) science

Ricardo Wurmus Sun, 18 Feb 2018 23:59:05 -0800

Amirouche Boubekki <[email protected]> writes:

> Then, in a follow up mail, you reply to Konrad:
>
>>> Konrad Hinsen <[email protected]> skribis:
>>
>> [...]
>>
>>> It would be nice if big datasets could conceptually be handled in the
>>> same way while being stored elsewhere - a bit like git-annex does for
>>> git. And for parallel computing, we could have special build daemons.
>>
>> Exactly.  I think we need a git-annex/git-lfs-like tool for the store.
>> (It could also be useful for things like secrets, which we don’t want
>> to have in the store.)


In addition to the answers by Ludo and Roel, I’d like to add that for
data we have more things that we’d like to know about.  For any given
dataset on storage I’d like to know how it relates to previous versions
of the same dataset.  The hash alone would not be sufficient.  I’d
actually need to know which dataset is the parent and which is a child.

The store does not give me relations like that when given two or more
items.  The store retains information about links between items in one
generation (if they embed such references), but not across generations.

I think the requirements for the storage and retrieval of (big) datasets
are very different to those of software packages.

There are projects dedicated to dataset storage, such as Pachyderm.io.
Since data storage is just a stepping stone to better workflows,
Pachyderm also includes support for application bundles, but it may be
better to let a dedicated workflow language take care of the application
side.

Maybe the GWL can be integrated with dedicated data storage solutions
like Pachyderm.

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

Re: Use guix to distribute data & reproducible (data) science

Reply via email to