On Fri, 04 Sep 2020, Steffen Möller wrote:
> * sharing data between colleagues - can you have two different versions > at the same time? sure, similar to git... well -- it is git ;) so multiple versions across collaborators, multiple versions on your own box etc -- all possible. When you get into it really, you might even like to start using BTRFS as your filesystem -- provides awesome CoW feature so you could breed your huge datasets without wasting too much space. Re versions: especially mind blowing is the ability to quickly switch between versions -- "large" files are just symlinks. The only gotcha remaining -- switching between dataset with subdatasets versions is not yet "convenienced", but it is possible to have multiple dataset hierarchy clones of different versions. > * I see this mostly orthogonal to the question how we organize our data > relative to whatever "dataRoot" we define well -- you could have disjoint datasets, it is not required to bring them all up into a superdataset, although that could have benefits. > * we still have a community-effort to collect the data from somewhere > (which likely is not a git repository) and post-process it (like some > indexing for a variety of tools) and to finally prepare the data somewhere for "processing" checkout "datalad run" and datalad-container extension providing "datalad container-run". Then you could you have your preprocessing entirely reproducible and simple provenance recorded within git commits. handbook on that: http://handbook.datalad.org/en/latest/basics/basics-run.html And Michael ATM is actively looking into making snakemake to tollerate datalad (well, git-annex), so you might like to define your snakemake workflows > * with some agreement between us on how to formulate the metadata in a > machine-readable manner so we know what tool needs to check out what > files for which workflows unfortunately cannot recommend anything specific ATM since not familiar with bioinformatics metadata and its use within workflows. > I should now read a bit in your handbook. And think a bit more about it > over the weekend. I hope you like it. Adina and Michael did (well -- still doing) awesome job with it. -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 WWW: http://www.linkedin.com/in/yarik