On Wed, May 08, 2013 at 10:47:14PM +0300, Oleg Goldshmidt wrote: > Elazar Leibovich <elaz...@gmail.com> writes: > > > Hi, > > > > I have a software product being built a few times a day (continuous > > integration style). The end product is an installable tar.gz with many > > java jars. > > > > Since the content of the tar.gz's is mostly the same, I want to use a > > filesystem that would dedupe the duplicated content. > > > > As I see it, it's s FUSE filesystem that: > > > > 1. When a file with .tar.gz extension stored, it untar it and store it > > in a folder (keeping the file order in a list). > > 2. When it is read again, it will tar gz the underlying folder, and > > will give the gzip'd result. > > 3. It will keep a list of file hashes, and would replace the file with > > a symlink to another file if possible. > > 4. Bonus: do the same for jars. Java is linked at runtime, so if a > > .java file didn't change - neither does its class. > > > > Is there anything like that available? > > Is there a smarter solution?
Can you afford a periodic scan by some service? I figure you could always trigger it with inotify otherwise, but there is an overhead. http://dedup.debian.net gives the following advice, that I have not yet tested: # Replace duplicate files with symlinks rdfind -outputname /dev/null -makesymlinks true debian/mypackage/ # Fix those symlinks to make them relative symlinks -r -s -c debian/mypackage/ > 1. I would probably look into using a version control system rather than > a filesystem. > > a) Modern version control systems are often/usually capable of > storing binary diffs between revisions. Frankly, I've never looked > at how git or mercurial do that (probably quite well), but even, > say, SVN should be able to store a binary diff on commit. IIRC SVN > diffs using xdelta or similar. > Git stores files. It should do handle such deduping by design. But this is in Git's storage, and not in the actual filesystem: tzafrir@pungenday:/tmp/git-test$ git init Initialized empty Git repository in /tmp/git-test/.git/ tzaf...@debian.org tzafrir@pungenday:/tmp/git-test$ dd if=/dev/urandom bs=1024 count=1024 of=rand 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB) copied, 0.0832973 s, 12.6 MB/s tzafrir@pungenday:/tmp/git-test$ du -s .git . 92 .git 1028 . tzafrir@pungenday:/tmp/git-test$ git add rand tzafrir@pungenday:/tmp/git-test$ git commit -m "rand" [master (root-commit) 401d035] rand 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 rand tzafrir@pungenday:/tmp/git-test(master)$ du -s .git . 1172 .git 1028 . tzafrir@pungenday:/tmp/git-test(master)$ cp rand rand1 tzafrir@pungenday:/tmp/git-test(master)$ git add rand1 tzafrir@pungenday:/tmp/git-test(master)$ git commit -m "rand1" [master a4d084f] rand1 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 rand1 tzafrir@pungenday:/tmp/git-test(master)$ du -s .git . 1188 .git 2052 . tzafrir@pungenday:/tmp/git-test(master)$ git ls-tree HEAD 100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa rand 100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa rand1 There are a number of backup systems / schemes that aim to provide file de-duplication. At least some of them use Git. -- Tzafrir Cohen | tzaf...@jabber.org | VIM is http://tzafrir.org.il | | a Mutt's tzaf...@cohens.org.il | | best tzaf...@debian.org | | friend _______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il