Bug#761117: debsources: file-level deduplication
On Thu, Sep 11, 2014 at 02:09:35PM +0800, Paul Wise wrote: > A hash based filesystem layout like we use on snapshot.d.o. > > Use a filesystem with deduplication support like btrfs. I thought about btrfs back in the days, and ruled out the idea because it imposes a fairly important deployment requirement. Regarding a hash-based filesystem layout, that will get in the way of dpkg-source -x, meaning you will need to "massage" the files into the has layout after package extraction. Plus, you lose the ability to use the natural file organization as the url structure that you present to the user. All in all, offline deduplication seems much more appealing and, up to now, it seems to have no drawbacks whatsoever (except a negligible lag between the extraction time and the deduplication run). Cheers. -- Stefano Zacchiroli . . . . . . . z...@upsilon.cc . . . . o . . . o . o Maître de conférences . . . . . http://upsilon.cc/zack . . . o . . . o o Former Debian Project Leader . . @zack on identi.ca . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club » signature.asc Description: Digital signature
Bug#761117: debsources: file-level deduplication
On Thu, Sep 11, 2014 at 4:31 AM, Stefano Zacchiroli wrote: > We already have all the file checksums in the database. Removing > (file-level) duplication in the file storage, using hard-links, can be > safely implemented offline, i.e., as long as no debsources update is > ongoing. I missed the talk, but some other ideas: A hash based filesystem layout like we use on snapshot.d.o. Use a filesystem with deduplication support like btrfs. -- bye, pabs https://wiki.debian.org/PaulWise -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#761117: debsources: file-level deduplication
Package: qa.debian.org Severity: wishlist We already have all the file checksums in the database. Removing (file-level) duplication in the file storage, using hard-links, can be safely implemented offline, i.e., as long as no debsources update is ongoing. Micro-benchmark (from my DebConf14 Debsources talk) of the expected disk space saving: select count(*) from checksums;-> 35'370'653 select count(distinct sha256) from checksums; -> 15'822'745 -- => deduplicated core: ~45% Cheers. -- Stefano Zacchiroli . . . . . . . z...@upsilon.cc . . . . o . . . o . o Maître de conférences . . . . . http://upsilon.cc/zack . . . o . . . o o Former Debian Project Leader . . @zack on identi.ca . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club » signature.asc Description: Digital signature