Points of clarification...

On Fri, Jul 20, 2018 at 7:59 PM Carl Boettiger via discuss
<[email protected]> wrote:

> Workflow-wise, I find Git LFS very compelling, but in practice, I found it 
> not to be viable for public GitHub projects in which you expect forks and 
> PRs.  GitHub's pricing model basically means that Git LFS breaks the fork / 
> PR workflow (see 
> https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91)
>   You can set up a different source (i.e. GitLab) to host the LFS part and 
> still have your repo on GitHub, see https://github.com/jimhester/test-glfs/, 
> but this was sufficiently cumbersome that I could not get it to work.

We had a pretty top-notch engineer (Dean) dedicate serious effort to
our bespoke setup for this at Gigantum. Agreed - it is non-trivial to
leave behind GitHub for an inexpensive provider. I haven't checked in
with the various "enterprise" providers. If folks have specific
interest / questions, I can bug Dean about it.

> I have not experimented with Git Annex or Dat, but my understanding is that 
> while these provide version control solution, they do not provide a 
> file-storage solution.  Dat is a peer-to-peer model, which I believe means 
> you need some 'peer' server always on and running somewhere when you want to 
> access your data.  My own need is almost the inverse of this problem -- I am 
> primarily looking for a mechanism to easily share data associated with a 
> project that already lives on GitHub (possibly public, possibly private), and 
> I want a way to give collaborators / students access to both download and 
> upload the data without asking them to adopt a workflow of tools that is any 
> more complicated than it needs to be.  e.g. sticking the data on Amazon S3 is 
> often good enough -- I can version data linearly with file names, I do not 
> need git merge capabilities -- but this does impose a significant overhead 
> for new users with needing to use aws cli or similar and set up more 
> authentication tokens.   A small barrier but enough to discourage 
> collaborators.

With specific regards to Git Annex, it does provide easy backup to a
variety of providers (s3, backblaze, rsync, ... see
http://git-annex.branchable.com/special_remotes/). It will even do
crazy things like "drop the file locally if at least 2 copies exist in
trusted repositories." You can use Git Annex also to track data that's
already backed up (e.g., at a URL that you trust) and it will still
checksum and verify it when you get a copy.

My understanding is that Dat is a bit more like bittorrent. You can
host stuff as much as you like, and drop when you want. But just like
with BitTorrent, it's not hard to set up a dedicated server that will
always host some content you care about.

These details remind me of another point, which is that no matter what
choice you make, the chances that it's a permanent solution seem hard.
Even with something as flexible as Git Annex. So part of the thinking
is what's your timeline for archival, assuming that no-one is finding
value in the data at-the-moment, and how easy would it be to
transition to something else. I'd argue that filesystem inclusive
solutions, or super-standardized API based systems (e.g. rsync, http,
SQL) are the best in that regard.

D

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M5c6f0a5f11ce5ae50994c6a9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Reply via email to