[Puppet - Feature #5650] Potentially able to save a great deal of time on File comparisons.

tickets Mon, 28 Nov 2011 13:25:19 -0800

Issue #5650 has been updated by Brice Figureau.

Daniel Pittman wrote:
> Brice Figureau wrote:
> 
> > I was thinking about this lately. My conclusion is that the file type makes 
> > it quite hard to implement something optimized.
> 
> Ultimately, I think it is the protocol between the type, provider, and the 
> Puppet transaction that make it difficult; because the type or provider have 
> no access to make smarter, multi-value decisions about fixing the subject of 
> their attention, we wind up in this position.

Yes I concur. But breaking this would be rather complex IMHO. My proposed 
solution although simple still fit in the framework, involves only a modest set 
of modification of the type system.

> > But there might be a way by defining a Lazy checksum which at first only 
> > contains file size and mtime. Then instead of using a string representation 
> > of said checksum in the agent as it is now, we use real instances of those 
> > Lazy checksums. 
> 
> We have historically avoided putting anything above basic data types in the 
> catalog, and intend to continue that.  So, whatever data structure is used 
> would need to be represented (and, ideally, representable) in a textual 
> format for both the RAL, and for the JSON transport format of the catalog.

My lazy checksum could be represented as a string (ie mtime concatenated with 
size or equivalent). The behavior would be in the lazy checksum mechanism. Yes 
this changes from what we have now: checksum don't exist as entity in puppet 
only as string content.

> Theoretically, though, your approach would work, but it involves working 
> around the existing semantics of property sync.  That kind of worries me, 
> because we have enough trouble working around them at the moment.

We're not really touching the sync semantic. Nothing says that we must have all 
the data at hand when doing property insync comparison. I'm just adding some 
proxy objects that might not have all the needed data available until the final 
comparison is performed. In the end we're still comparing property equality but 
on checksum objects instead on checksum string representations.

> I guess that something akin to a "set of checksums" that must all be 
> satisfied would work, though: first, satisfy that size matches, and second, 
> satisfy that sha1(first 8192 bytes), third, satisfy sha1(whole file).

This would change fundamentally how properties work, isn't it?

----------------------------------------
Feature #5650: Potentially able to save a great deal of time on File 
comparisons.
https://projects.puppetlabs.com/issues/5650

Author: Trevor Vaughan
Status: Accepted
Priority: Normal
Assignee: 
Category: file
Target version: 
Affected Puppet version: 
Keywords: 
Branch: 

I've been looking at the usage of MD5 checksums by Puppet and I think
that there may be room for quite a bit of optimization.

The clients seem to compute the MD5 checksum of all files and in
catalog content every time they compare two files. What if:

1) The size of any known content is used as a first level comparison.
Obviously, if the sizes differ, the files differ. I don't see this in
0.24.X, but I haven't checked 2.6.X.

2) The *server* pre-computes checksums for all content items in File
resources and passes those in the catalog, then only one MD5 sum needs
to be calculated.

3) When using the puppet server in a 'source' element, the server
passes the checksum of the file on the server. If they differ, then
the file is passed across to the client.

4) For ultimate speed, a direct comparison should be an option as a
checksum type. Directly comparing the content of the in-memory file
and the target file appears to be twice as fast as an MD5 checksum.
This would not be feasible for a 'source'.

These techniques will place more burden on the server, but may cut the
CPU resources needed on the client by as much as half from some
preliminary testing.

             user      system      total       real
    MD5:   0.810000   0.230000   1.040000 (  1.050886)
    MD52:  0.400000   0.120000   0.520000 (  0.525936)
    Hash:  0.550000   0.270000   0.820000 (  0.821033)
    Comp:  0.290000   0.120000   0.410000 (  0.407351)

MD5 -> MD5 comparison of two 100M files
MD52 -> MD5 comparison where one file has been pre-computed
Hash -> Using String.hash to do the comparison
Comp -> Direct comparison of the files

For any technique that does not compute a checksum of the file, I would think 
that a good item to record to note the change difference would be a combination 
of the latest modified time and the size of the file. This would make for an 
easy numeric comparison string and the time can be set at the time that you 
update the file.

-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Puppet - Feature #5650] Potentially able to save a great deal of time on File comparisons.

Reply via email to