On 17/12/10 16:44, Trevor Vaughan wrote:
> I've been looking at the usage of MD5 checksums by Puppet and I think
> that there may be room for quite a bit of optimization.

I do agree.

> The clients seem to compute the MD5 checksum of all files and in
> catalog content every time they compare two files. What if:
> 
> 1) The size of any known content is used as a first level comparison.
> Obviously, if the sizes differ, the files differ. I don't see this in
> 0.24.X, but I haven't checked 2.6.X.

That's more or less what rsync does. For sourced files we could even use
HTTP If-Modified-Since and/or If-None-Match to perform the check (and
thus the check would be done server side).

> 2) The *server* pre-computes checksums for all content items in File
> resources and passes those in the catalog, then only one MD5 sum needs
> to be calculated.

That's something I already noticed when I worked on the file streaming.
We're constantly checksuming files. For instance we perform a full
checksum when writing a file, then once written we checksum againt to
make sure we wrote the file fully. At that time I left the code as it
was, but I think this might not be necessary.

> 3) When using the puppet server in a 'source' element, the server
> passes the checksum of the file on the server. If they differ, then
> the file is passed across to the client.

As I said earlier we really could leverage
If-Modified-Since/If-None-Match HTTP/1.1 system for this.

> 4) For ultimate speed, a direct comparison should be an option as a
> checksum type. Directly comparing the content of the in-memory file
> and the target file appears to be twice as fast as an MD5 checksum.
> This would not be feasible for a 'source'.

That might be faster, but please don't re-introduce the slurp the whole
file in memory syndrom.

> These techniques will place more burden on the server, but may cut the
> CPU resources needed on the client by as much as half from some
> preliminary testing.
> 
>   user     system      total        real
>  MD5:   0.810000   0.230000   1.040000 (  1.050886)
> MD52:  0.400000   0.120000   0.520000 (  0.525936)
> Hash:   0.550000   0.270000   0.820000 (  0.821033)
> Comp:  0.290000   0.120000   0.410000 (  0.407351)
> 
> MD5 -> MD5 comparison of two 100M files
> MD52 -> MD5 comparison where one file has been pre-computed
> Hash -> Using String.hash to do the comparison
> Comp -> Direct comparison of the files

For comp: did you read the file fully in RAM or did you do it block by
block?
If you read it fully, can you try again the experiment by reading it
block by block (let's say 8k) with equivalent files (so that's your
worst case) and compare to the in-memory solution?

For file-change-comparison we might introduce some new checksums that
are quite less cpu-hogs than full message digests. I'm really not an
expert in this, so maybe I'm completely wrong, but combining
size-change, mtime-change and a fletcher/adler or other CRC checksum
might give us what we want.

> If anyone can provide a quick and dirty hack to get these into Puppet,
> I'll be happy to test them.

That's really something I'd like to work on. Unfortunately this is
really complex stuff. The file type is one of the biggest type and even
though I already worked on it, I'm not sure I grasped enough to be able
to fully refactor it for a different inner working.
-- 
Brice Figureau
My Blog: http://www.masterzen.fr/

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Developers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-dev?hl=en.

Reply via email to