Re: Compressed Pristines (Simulation)

Ashod Nakashian Sun, 25 Mar 2012 20:40:25 -0700

----- Original Message -----

> From: Johan Corveleyn <jcor...@gmail.com>
> To: Ashod Nakashian <ashodnakash...@yahoo.com>
> Cc: "dev@subversion.apache.org" <dev@subversion.apache.org>
> Sent: Monday, March 26, 2012 3:10 AM
> Subject: Re: Compressed Pristines (Simulation)
> 
> On Sun, Mar 25, 2012 at 7:17 PM, Ashod Nakashian
> <ashodnakash...@yahoo.com> wrote:
> [snip]
>>>  From: Hyrum K Wright <hyrum.wri...@wandisco.com>
> [snip]
>>> In some respects, it looks like you're solving *two* problems:
>>> compression and the internal fragmentation due to large FS block
>>> sizes.  How orthogonal are the problems?  Could they be solved
>>> independently of each other in some way?  I know that compression
>>> exposes the internal fragmentation issue, but used alone it certainly
>>> doesn't make things *worse* does it?
>> 
>>  Compression exposes internal fragmentation and, yes, it makes it *worse* 
> (see hard numbers below).
>>  Therefore, compression and internal fragmentation are orthogonal only if we 
> care about absolute savings. (In other words, compressed files cause more 
> internal fragmentation, but overall footprint is still reduced, however not 
> as 
> efficiently as ultimately possible.)
> 
> By "doesn't make things worse", maybe Hyrum meant that compression
> doesn't magically cause more blocks to be used because of
> fragmentation. I mean, sure there is more fragmentation relative to
> the amount of data, but that's just because the amount of data
> decreased, right? Anyway, it depends on how you look at it, not too
> important.


Yes. I should've made myself clearer. I meant that's the case BUT the 
opportunity for further reduction in disk space is also increased (which is a 
prime interest in this feature).

> 
> [snip]
> 
>> 
>>  Since the debated techniques (individual gz files vs packing+gz) for 
> implementing Compressed Pristines are within reach with existing tools, and 
> indeed some tests were done to yield hard figures (see older messages in this 
> thread), it's reasonable to run simulations that can show with hard numbers 
> the extent to which the speculations and estimations done (mostly by yours 
> truly) regarding the advantages of a custom pack file are justifiable.
>> 
> 
> I didn't read the design doc yet (it's a bit too big for me at the
> moment, I'm just following the dev-threads), so sorry if I'm saying
> nonsense.

You should read it :-) As you'll see, some of your points are what's suggested.

> 
> Wouldn't gz+packing be another interesting compromise? It wouldn't
> exploit inter-file similarities, but it would yield compression and
> reduced fragmentation. Can you test that as well? Maybe that gets us
> "80% gain for 20% of the effort" ...

Yes. And that's a "stage" in the implementation of this feature. We don't have 
to go for the optimum implementation right-away, just packing will help. What's 
argued is whether or not a custom file format (the pack file) is necessary or 
not.

> 
> I'm certainly not an expert, but intuitively the packing+gz approach
> seems difficult to me, if only because you need to uncompress a full
> pack file to be able to read a single pristine (since offsets are
> relative to the uncompressed stream). So the advantage of exploiting
> inter-file similarities better be worth it.

To avoid that, we'll compress individual blocks (each containing at least 1 
file, and potentially many to exploit inter-file similarities).

> 
> When going for gz+packing, there is no need to uncompress an entire
> pack file just to start reading a single pristine. You can just keep
> offsets (in wc.db or whatever) to where the individual compressed
> pristines exist inside the pack file.

Indeed, it's what's proposed. Although there are two suggestions: a custom 
index file that dumps structures that can be reloaded fast or wc.db.

> 
> Why not simply compress the "shards" we already have in the pristine
> store (sharded by the first two characters of the pristine checksum)?
> Or do we run the risk that such compressed shards are going to become
> too large (e.g. larger than 2 GB), and we want to avoid such a thing?

May be too large (as you suspected), what unites the files within a shard is 
their hash and not contents, requires the same infrastructure to locate a file, 
same overhead of managing inserts/deletes... so doesn't have any advantage and 
all the same problems.

The proposal is to have a custom file format that is simple and supports all 
our requirements out of the box and is file-name and file-type aware and can 
exploit all that if necessary.

-Ash

> 
> -- 
> Johan
>

Re: Compressed Pristines (Simulation)

Reply via email to