On Sun, Mar 25, 2012 at 7:17 PM, Ashod Nakashian <ashodnakash...@yahoo.com> wrote: [snip] >> From: Hyrum K Wright <hyrum.wri...@wandisco.com> [snip] >>In some respects, it looks like you're solving *two* problems: >>compression and the internal fragmentation due to large FS block >>sizes. How orthogonal are the problems? Could they be solved >>independently of each other in some way? I know that compression >>exposes the internal fragmentation issue, but used alone it certainly >>doesn't make things *worse* does it? > > Compression exposes internal fragmentation and, yes, it makes it *worse* (see > hard numbers below). > Therefore, compression and internal fragmentation are orthogonal only if we > care about absolute savings. (In other words, compressed files cause more > internal fragmentation, but overall footprint is still reduced, however not > as efficiently as ultimately possible.)
By "doesn't make things worse", maybe Hyrum meant that compression doesn't magically cause more blocks to be used because of fragmentation. I mean, sure there is more fragmentation relative to the amount of data, but that's just because the amount of data decreased, right? Anyway, it depends on how you look at it, not too important. [snip] > > Since the debated techniques (individual gz files vs packing+gz) for > implementing Compressed Pristines are within reach with existing tools, and > indeed some tests were done to yield hard figures (see older messages in this > thread), it's reasonable to run simulations that can show with hard numbers > the extent to which the speculations and estimations done (mostly by yours > truly) regarding the advantages of a custom pack file are justifiable. > I didn't read the design doc yet (it's a bit too big for me at the moment, I'm just following the dev-threads), so sorry if I'm saying nonsense. Wouldn't gz+packing be another interesting compromise? It wouldn't exploit inter-file similarities, but it would yield compression and reduced fragmentation. Can you test that as well? Maybe that gets us "80% gain for 20% of the effort" ... I'm certainly not an expert, but intuitively the packing+gz approach seems difficult to me, if only because you need to uncompress a full pack file to be able to read a single pristine (since offsets are relative to the uncompressed stream). So the advantage of exploiting inter-file similarities better be worth it. When going for gz+packing, there is no need to uncompress an entire pack file just to start reading a single pristine. You can just keep offsets (in wc.db or whatever) to where the individual compressed pristines exist inside the pack file. Why not simply compress the "shards" we already have in the pristine store (sharded by the first two characters of the pristine checksum)? Or do we run the risk that such compressed shards are going to become too large (e.g. larger than 2 GB), and we want to avoid such a thing? -- Johan