On Thu, 28 Nov 2013 02:10:42 -0800 (PST) Mario Wohlwender <mario.wohlwen...@torqeedo.com> wrote:
> Hi git users, > > I want to use a git-Repository to handle a folder with lot of > documents (.pdf...) needed for manufacturing of my companys products > (technical drawings, Gerber-Files for PCB Manufacturing and so on). > > Because the data is mainly binary (.pdf) and the size of the Folder > is very large (>1GB), I thougt of a workflow git is probably not made > for (to avoid everybody who wants to change something has to clone > the repository). > The basic idea is to have a work area with a git repository on the > company's network. To this work area (directory) all employees have > read access and only the development department has write access. To > have an history of who changed what and when I would like to use git > on this folder. > Is it a problem when more than one client (developer) makes commits > to this one workarea/repository? The access to the repository is by > file protocol, so the git-commands would be executed on the local > computer of the developer. > Is the git-Repository locked while one command is active? Or can the > database get corrupted when two developers start commands > simultaneously on the repository? > > Would be nice to hear your opinion about this workflow. I'm afraid it's not gonna work, and here's why. While Git locks access to its own database (the ".git" directory inside the repository's work tree) using a lock file, this only prevents other Git instances from messing with the repository when another one is active. Contrary to this, accessing files in the work tree is inherently racy: I don't know offhand if Git flock()'s the files in the work tree while it's reading them but even if it does, some file operations are inherently non-atomic, and, further, flock() is only advisory. Hence a situation like this one might well occur: 1) Developer A updates a file foo.pdf and decides to `git add` it. 2) In the meantime developer B decides to update that same foo.pdf, so the B's file copying program and A's Git access the file simultaneously, at the same time. What happens, I don't know. A good file manager would first upload a file, fsync() its contents, then delete the old file and rename the new one to have the old one's name. This leaves a tiny window for `git add` to miss out the old file, which would prevent `git add` from doing its work but it's mostly OK because the error is visible. Still, a bad file manager program might decide to upload the new contents by truncating the old file and writing the new data over. This obviously would make `git add` have a non-zero chance to access an incomplete file. What I mean, is that under "normal" course of operation (where each developer has their private repository) Git assumes these developers manage to make sure theirselves ther repositories are not subject to races like I outlined above, and you're likely to break this assumption. Git itself does not support branch or file locks, and its design explicitly goes against this, so you can't do that. With this in mind, I might propose these solutions: * Make several Git repositories, grouping files managed using them by types or some other criteria. This would allow "normal" workflow, where each developer has a set of personal clones, but supposedly the disk space requirements in this case will be tolerable. You'll then have to invent a scheme for updating a read-only file share when a push to an appropriate central repository is done. One modification to this is to have a shared (centralized) huge repository, which contains everything, but with different kinds of data kept on a different branch, and then you have a set of repositories each of which hosts just one branch of that data. Each developer might then clone only a "single-branch" repo, when needed, but they have to push their updated branch to both repos. Or, alternatively, post-update hooks in those "slim" repos would do that theirselves. While the common sense might tell you this is not gonna work, Git only cares about commits, trees and blobs, and it's perfectly OK to have the same graph of commits in different repos. * Use Subversion: it allows you to organize your data as a set of directories and check out only a single directory to update it. I don't know your setup precisely, but I think Subversion fits your model better than Git: it appears you don't need Git's killer feaures like cheap branching, rebasing, stashing per-hunk staging and resetting etc -- it looks like what you need is just something which is linearly updated with ready binary data (PDFs and all). So may be the way to go is to use a powerful DVCS (Git) to manage *sources* for those PDFs and other stuff, using different repositories for different projects, and Subversion to keep a set of everything ready to be consumed by downstream users (non-developers). -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.