Luke,

Thanks for the clarification. I think I was looking at the 0.24
codebase so this may have thrown me off a bit.

Could you point me to the Git rev(s) where the stream summing is described?

Thanks!

Trevor

On Wed, Jan 5, 2011 at 2:00 PM, Luke Kanies <[email protected]> wrote:
> On Dec 29, 2010, at 3:41 AM, Trevor Vaughan wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> This was just a comparison script that I wrote.
>>
>> Checking through the Ruby source, it definitely looks like the digest
>> methods just pull the entire "string" into memory.
>
> Yep, that's why we had to go through the effort of adding stream summing.
>
>> If this is a file, then we're already taking the memory hit that you
>> would take by just comparing the two files.
>
> Not with Puppet we're not - we do stream summing for both files, and then 
> compare the sums.
>
>> This makes complete sense since Digest doesn't know what you're passing.
>>
>> I will note that it looks like chunking a file and performing the
>> checksum might take twice as long.
>
> This is the big thing - I agree that this could all be much faster via the 
> mechanisms you've proposed.
>
> It's just important to know how Puppet works internally right now and why.  
> The stream summing is very important for us because if Ruby loads 100mb files 
> into memory, it basically never frees that memory again.  Thus, we give up 
> speed for drastically better ram efficiency, and we don't want to go back to 
> the bad old ram days.
>
>> I'm thinking that size+time (similar to rsync) might be enough for most
>> files on a system. There will be relatively few files on a system that
>> you'll want to do a full checksum on.
>
> I agree.
>
>> Thanks,
>>
>> Trevor
>>
>> On 12/26/2010 12:40 AM, Luke Kanies wrote:
>>> On Dec 23, 2010, at 4:58, Trevor Vaughan <[email protected]> wrote:
>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Brice,
>>>>
>>>> Thanks for the feedback, this is good stuff!
>>>>
>>>>>
>>>>> That's more or less what rsync does. For sourced files we could even use
>>>>> HTTP If-Modified-Since and/or If-None-Match to perform the check (and
>>>>> thus the check would be done server side).
>>>>
>>>> Yes, I briefly looked at the Rsync algorithm papers to see if I could
>>>> figure out how to re-implement it in Ruby but just using the native
>>>> Rsync libraries might be a better call. However, that would introduce an
>>>> external dependency.
>>>>
>>>>>
>>>>>> 4) For ultimate speed, a direct comparison should be an option as a
>>>>>> checksum type. Directly comparing the content of the in-memory file
>>>>>> and the target file appears to be twice as fast as an MD5 checksum.
>>>>>> This would not be feasible for a 'source'.
>>>>>
>>>>> That might be faster, but please don't re-introduce the slurp the whole
>>>>> file in memory syndrom.
>>>>
>>>> It seems that MD5 might be doing it anyway. When I tried a block-wise
>>>> 'comp', it was *much* slower and I think it was even slower than MD5 (or
>>>> close anyway) which means that MD5 is reading the whole blob into memory
>>>> to work on it anyway! If we're going to take the memory hit, let's just
>>>> take it and compare the two items.
>>>
>>> Is this an md5 script you wrote, or are you using the Puppet code?
>>> We've worked to add 'stream' checksum types that checksum the file a
>>> bit at a time.
>>>
>>> I expect that most of those are actually a good bit slower than just
>>> reading the whole thing in and checksumming, but they're faster by
>>> being less ram-efficient.
>>>
>>>>> That's really something I'd like to work on. Unfortunately this is
>>>>> really complex stuff. The file type is one of the biggest type and even
>>>>> though I already worked on it, I'm not sure I grasped enough to be able
>>>>> to fully refactor it for a different inner working.
>>>>
>>>> Completely agreed. I'll do what I can to help, but my outside time is
>>>> severely limited.
>>>
>>>
>>
>> - --
>> Trevor Vaughan
>> Vice President, Onyx Point, Inc.
>> email: [email protected]
>> phone: 410-541-ONYX (6699)
>> pgp: 0x6C701E94
>>
>> - -- This account not approved for unencrypted sensitive information --
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.11 (GNU/Linux)
>>
>> iQEcBAEBAgAGBQJNGx5LAAoJECNCGV1OLcypUrQH/RjbY56VfBunWk5rV1cgMCSO
>> VMzXjqY0HhyAvOtYpcesYDpvPHNsnSBx3684TCX1+VYfy8vh9lFy6CxEqB3ohwN5
>> gHjIBs2c6ZpT8UloywkwbMwAkFnqFXMfQ2/ELOfGvKsHwWq+Z9uVxW/vPxmswPJ0
>> U6qiDnmk762OfRyD0/sBNsYljnUXwDBidWC9up9WO+hEz9bSr+NLSxMc+5PsVjyl
>> kRtGtnBNqnE8Sw8VEjGKjrHkuoCR9pqAiGU2KM4h827zkog5oy0ghPolnEJXMD82
>> ErEXo+Y6C7xZc7U62+0eS96Zb0LZi9B412c5PpB08TEP18lJwCwSWWtY47dSJgQ=
>> =FPbW
>> -----END PGP SIGNATURE-----
>> <tvaughan.vcf>
>
>
> --
> On Bureaucracy....
>        The Pythagorean theorem contains 24 words. Archimedes
> Principle, 67.  The Ten Commandments, 179. The American Declaration of
> Independence, 300. And recent legislation in Europe concerning when
> and where to smoke, 23,942.      -- The European, June 23-29, 1995
> ---------------------------------------------------------------------
> Luke Kanies  -|-   http://puppetlabs.com   -|-   +1(615)594-8199
>
>
>
>
>



-- 
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
[email protected]

-- This account not approved for unencrypted proprietary information --

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Developers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-dev?hl=en.

Reply via email to