Bug#761117: debsources: file-level deduplication

2014-09-11 Thread Stefano Zacchiroli
On Thu, Sep 11, 2014 at 02:09:35PM +0800, Paul Wise wrote:
> A hash based filesystem layout like we use on snapshot.d.o.
> 
> Use a filesystem with deduplication support like btrfs.

I thought about btrfs back in the days, and ruled out the idea because
it imposes a fairly important deployment requirement.

Regarding a hash-based filesystem layout, that will get in the way of
dpkg-source -x, meaning you will need to "massage" the files into the
has layout after package extraction. Plus, you lose the ability to use
the natural file organization as the url structure that you present to
the user.

All in all, offline deduplication seems much more appealing and, up to
now, it seems to have no drawbacks whatsoever (except a negligible lag
between the extraction time and the deduplication run).

Cheers.
-- 
Stefano Zacchiroli  . . . . . . .  z...@upsilon.cc . . . . o . . . o . o
Maître de conférences . . . . . http://upsilon.cc/zack . . . o . . . o o
Former Debian Project Leader  . . @zack on identi.ca . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »


signature.asc
Description: Digital signature


Bug#761117: debsources: file-level deduplication

2014-09-10 Thread Paul Wise
On Thu, Sep 11, 2014 at 4:31 AM, Stefano Zacchiroli wrote:

> We already have all the file checksums in the database. Removing
> (file-level) duplication in the file storage, using hard-links, can be
> safely implemented offline, i.e., as long as no debsources update is
> ongoing.

I missed the talk, but some other ideas:

A hash based filesystem layout like we use on snapshot.d.o.

Use a filesystem with deduplication support like btrfs.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#761117: debsources: file-level deduplication

2014-09-10 Thread Stefano Zacchiroli
Package: qa.debian.org
Severity: wishlist

We already have all the file checksums in the database. Removing
(file-level) duplication in the file storage, using hard-links, can be
safely implemented offline, i.e., as long as no debsources update is
ongoing.

Micro-benchmark (from my DebConf14 Debsources talk) of the expected disk
space saving:

select count(*) from checksums;-> 35'370'653
select count(distinct sha256) from checksums;  -> 15'822'745
  --
  => deduplicated core: ~45%

Cheers.
-- 
Stefano Zacchiroli  . . . . . . .  z...@upsilon.cc . . . . o . . . o . o
Maître de conférences . . . . . http://upsilon.cc/zack . . . o . . . o o
Former Debian Project Leader  . . @zack on identi.ca . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »


signature.asc
Description: Digital signature