I looked at datasets.datalad.org. I could well imagine to use your technology for other (larger) databases like Pfam or UniProt or PDB. For cute little genomes my initial reaction was that I felt overwhelmed. Your pointer will certainly help to define what we want. Many thanks!
On 03.09.20 18:09, Yaroslav Halchenko wrote: > You might like to listen to debconf20 talk on DataLad ;-) > > At some point I have started even to establish some kind of dh-datalad > helper so that .deb package would contain a datalad dataset > (git/git-annex repo), and would just `get` data files upon > installation... So -- yes, they would not be "self contained" but it > is infeasible for any sizeable data packages on debian. But they could > be versioned, point to specific git state of corresponding datasets, > provide lightweight and efficient upgrades (only changed/new files would > need to be fetched), etc. They could be partitioned into smaller > subdatasets or custom views to be provided, like we have > > https://github.com/datalad-datasets/hcp-structural-preprocessed > which is a selection from a larger > https://github.com/datalad-datasets/human-connectome-project-openaccess > > Never finished that helper though -- we just (develop and) use datalad > directly and had no debian packages which would need strict dependency > on the datasets. More of sample datasets could be found on > https://datasets.datalad.org/ -- data primarily comes from original > repositories, and covers now > 200TB > > We had started to collect resources someone might like to datalad'ify > relevant to bioinformatics: > https://github.com/datalad/datalad/milestone/14?closed=1 > but since we are not in bioinformatics field, never actually addressed > them. > > I also know that https://github.com/notestaff is actively using > git-annex (not sure if datalad -- but he did submit some issues, so he > might) for bioinformatics. Might be worth checking with him > if git-annex/datalad would be decided to be used. > > On Thu, 03 Sep 2020, Steffen Möller wrote: > >> Hello, >> We are closing in on the workflows. What is kind of missing are the >> mostly invariant inputs like the genomes of pathogens and very much so >> the reference genomes of the human, mouse, rat, worm, fly, .... you name >> them. >> Other than a few years ago, hard drives are now big enough to >> accommodate the one or other genome and derivative indexes. Just - I >> don't think we want to organize in our regular Debian infrastructure >> something as variant as public genome (yes, they are still regularly >> updated, very much so) and that is so very security-irrelevant (just >> some data). Also, different sites will vary a lot in where this data >> shall be organized and all those scripts should likely be >> executed/initiated as/by non-root. There are public sites for this from >> where this data can be downloaded. Any redundancy to these sites imho >> mostly hurts us. The other side is that to just get something up quickly >> and for reproducibility tests, our infrastructure is difficult to beat. >> Please kindly throw your ideas at me how you would like whole genomes to >> be presented by Debian to the average user and to professionals. Just >> reply to this thread and/or send me "+1"s a PM and I summarize this up >> in a document which I suggest we then talk about in a jitsi meeting.