Re: Compressed Pristines (Summary)

Justin Erenkrantz Fri, 06 Apr 2012 10:19:48 -0700

On Wed, Apr 4, 2012 at 1:28 PM, Ashod Nakashian
<ashodnakash...@yahoo.com> wrote:
> I feel this is indeed what we're closing on, at least for an initial working 
> demo. But I'd like to hear more agreements before committing to this path. I 
> know some did show support for this approach, but it's hard to track them in 
> the noise.
>
> So to make it easier, let's either voice support to this suggestion and 
> commit to an implementation, or voice objection with at least reasons and 
> possibly alternative action. Silence is passive agreement, so the onus on 
> those opposing ;-)


I just read the Google doc - glad to see progress here - a few comments:

First off, if I understand correctly, I do have to say that I'm not at
all a fan of having a large pristine file spread out across multiple
on-disk compressed pack files.  I don't think that makes a whole lot
of sense - I think it'd be simplest (when we hit the heuristic to put
it on-disk rather than in SQLite) to keep it to just one file.  I
don't get why we'd want to have a big image pristine file (say a PSD
file) split out into say 20 smaller files on disk.  Why?  It just
seems we're going to introduce a lot of complexity for very little
return.  The whole point of stashing the small files directly into
SQLite's pristine.db is to make the small files SQLite's problem and
not the on-disk FS (and reduce sub-block issues) - with that in place,
I think we're not going to need to throw multiple files into the same
pack file.  It'll just get too confusing, IMO, to keep track of which
file offsets to use.  (For a large file that already hits the size
trigger, we know that - worst case scenario - we might lose one FS
block.  Yawn.)  We can make the whole strategy simpler if we follow
that.

I'm with Greg in thinking that we don't need the pack index files -
but, I think I'll go further and reiterate that I think that there
should just be a 1:1 correspondence between the pack file and the
pristine file.  What's the real advantage of having multiple large
pristines in one pack file (and that we constantly *append* to)?  And,
with append FS ops with multiple files in one pack file, we rely on
our WC/FS locking strategy to be 100% perfect or we have a hosed pack
file.  Ugh.  I think it just adds unnecessary complexity.  I think
it'll be far simpler to have 1:1 correspondence with a pack file to a
single large pristine.  We'll have enough complexity already to just
find the small files sitting in SQLite rather than on-disk.

Given that 1:1 correspondence, I do think that the original file-size
and the complete checksum should be stored in the custom pack file
on-disk.  It'll make it so that we could easily validate whether the
pack file is corrupt or not by using file-size (as a first-order
check) and checksum (as second-order).  The thinking here is that if
the checksum is not in the file contents, but only in the file name
(or the pristine.db), the file system could very easily lose the
filename (hello ext3/4!) - this would allow us to verify the integrity
of the pack file and reassociate it if it gets dis-associated.  This
is less of an issue with the client as it can always refetch - but, if
the server code ends up using the same on-disk format (as hinted in
the Google Doc)...then, I think this is important to have in the file
format from the beginning.

I definitely think that we should store the full 64-bit length
svn_filesize_t and not be cute and assume no one has a file larger
than 1TB.

I'll disagree with Greg for now and say that it's probably best to
just pack everything and not try to be cute about not packing certain
file types - I think that's a premature optimization right now.  I
think the complexity of having a mixed pristine collection with some
files packed and some files unpacked is odd (and some files in SQLite
and some files on-disk).  Maybe end up adding a no-op compression type
to the file format (IOW, tell gzip to do a no-op inflate via
Z_NO_COMPRESSION).  Maybe.  I just doubt it's worth the additional
complexity though.  ("Is this pristine file compressed?"  "I don't
know."  Argh!)  Making those assumptions based on file extensions or
even magic bits can be a bit awkward - case in point is PDF...some
times it'll compress well, some times it won't.  So, best off just
always compressing it.  =)

My $.02.  -- justin

Re: Compressed Pristines (Summary)

Reply via email to