----- Original Message ----- > From: Johan Corveleyn <jcor...@gmail.com> > To: Ashod Nakashian <ashodnakash...@yahoo.com> > Cc: "dev@subversion.apache.org" <dev@subversion.apache.org> > Sent: Monday, March 26, 2012 3:10 AM > Subject: Re: Compressed Pristines (Simulation) > > On Sun, Mar 25, 2012 at 7:17 PM, Ashod Nakashian > <ashodnakash...@yahoo.com> wrote: > [snip] >>> From: Hyrum K Wright <hyrum.wri...@wandisco.com> > [snip] >>> In some respects, it looks like you're solving *two* problems: >>> compression and the internal fragmentation due to large FS block >>> sizes. How orthogonal are the problems? Could they be solved >>> independently of each other in some way? I know that compression >>> exposes the internal fragmentation issue, but used alone it certainly >>> doesn't make things *worse* does it? >> >> Compression exposes internal fragmentation and, yes, it makes it *worse* > (see hard numbers below). >> Therefore, compression and internal fragmentation are orthogonal only if we > care about absolute savings. (In other words, compressed files cause more > internal fragmentation, but overall footprint is still reduced, however not > as > efficiently as ultimately possible.) > > By "doesn't make things worse", maybe Hyrum meant that compression > doesn't magically cause more blocks to be used because of > fragmentation. I mean, sure there is more fragmentation relative to > the amount of data, but that's just because the amount of data > decreased, right? Anyway, it depends on how you look at it, not too > important.
Yes. I should've made myself clearer. I meant that's the case BUT the opportunity for further reduction in disk space is also increased (which is a prime interest in this feature). > > [snip] > >> >> Since the debated techniques (individual gz files vs packing+gz) for > implementing Compressed Pristines are within reach with existing tools, and > indeed some tests were done to yield hard figures (see older messages in this > thread), it's reasonable to run simulations that can show with hard numbers > the extent to which the speculations and estimations done (mostly by yours > truly) regarding the advantages of a custom pack file are justifiable. >> > > I didn't read the design doc yet (it's a bit too big for me at the > moment, I'm just following the dev-threads), so sorry if I'm saying > nonsense. You should read it :-) As you'll see, some of your points are what's suggested. > > Wouldn't gz+packing be another interesting compromise? It wouldn't > exploit inter-file similarities, but it would yield compression and > reduced fragmentation. Can you test that as well? Maybe that gets us > "80% gain for 20% of the effort" ... Yes. And that's a "stage" in the implementation of this feature. We don't have to go for the optimum implementation right-away, just packing will help. What's argued is whether or not a custom file format (the pack file) is necessary or not. > > I'm certainly not an expert, but intuitively the packing+gz approach > seems difficult to me, if only because you need to uncompress a full > pack file to be able to read a single pristine (since offsets are > relative to the uncompressed stream). So the advantage of exploiting > inter-file similarities better be worth it. To avoid that, we'll compress individual blocks (each containing at least 1 file, and potentially many to exploit inter-file similarities). > > When going for gz+packing, there is no need to uncompress an entire > pack file just to start reading a single pristine. You can just keep > offsets (in wc.db or whatever) to where the individual compressed > pristines exist inside the pack file. Indeed, it's what's proposed. Although there are two suggestions: a custom index file that dumps structures that can be reloaded fast or wc.db. > > Why not simply compress the "shards" we already have in the pristine > store (sharded by the first two characters of the pristine checksum)? > Or do we run the risk that such compressed shards are going to become > too large (e.g. larger than 2 GB), and we want to avoid such a thing? May be too large (as you suspected), what unites the files within a shard is their hash and not contents, requires the same infrastructure to locate a file, same overhead of managing inserts/deletes... so doesn't have any advantage and all the same problems. The proposal is to have a custom file format that is simple and supports all our requirements out of the box and is file-name and file-type aware and can exploit all that if necessary. -Ash > > -- > Johan >