Amirouche Boubekki <[email protected]> writes: > Then, in a follow up mail, you reply to Konrad: > >>> Konrad Hinsen <[email protected]> skribis: >> >> [...] >> >>> It would be nice if big datasets could conceptually be handled in the >>> same way while being stored elsewhere - a bit like git-annex does for >>> git. And for parallel computing, we could have special build daemons. >> >> Exactly. I think we need a git-annex/git-lfs-like tool for the store. >> (It could also be useful for things like secrets, which we don’t want >> to have in the store.)
In addition to the answers by Ludo and Roel, I’d like to add that for data we have more things that we’d like to know about. For any given dataset on storage I’d like to know how it relates to previous versions of the same dataset. The hash alone would not be sufficient. I’d actually need to know which dataset is the parent and which is a child. The store does not give me relations like that when given two or more items. The store retains information about links between items in one generation (if they embed such references), but not across generations. I think the requirements for the storage and retrieval of (big) datasets are very different to those of software packages. There are projects dedicated to dataset storage, such as Pachyderm.io. Since data storage is just a stepping stone to better workflows, Pachyderm also includes support for application bundles, but it may be better to let a dedicated workflow language take care of the application side. Maybe the GWL can be integrated with dedicated data storage solutions like Pachyderm. -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net
