Luke, Thanks for the clarification. I think I was looking at the 0.24 codebase so this may have thrown me off a bit.
Could you point me to the Git rev(s) where the stream summing is described? Thanks! Trevor On Wed, Jan 5, 2011 at 2:00 PM, Luke Kanies <[email protected]> wrote: > On Dec 29, 2010, at 3:41 AM, Trevor Vaughan wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> This was just a comparison script that I wrote. >> >> Checking through the Ruby source, it definitely looks like the digest >> methods just pull the entire "string" into memory. > > Yep, that's why we had to go through the effort of adding stream summing. > >> If this is a file, then we're already taking the memory hit that you >> would take by just comparing the two files. > > Not with Puppet we're not - we do stream summing for both files, and then > compare the sums. > >> This makes complete sense since Digest doesn't know what you're passing. >> >> I will note that it looks like chunking a file and performing the >> checksum might take twice as long. > > This is the big thing - I agree that this could all be much faster via the > mechanisms you've proposed. > > It's just important to know how Puppet works internally right now and why. > The stream summing is very important for us because if Ruby loads 100mb files > into memory, it basically never frees that memory again. Thus, we give up > speed for drastically better ram efficiency, and we don't want to go back to > the bad old ram days. > >> I'm thinking that size+time (similar to rsync) might be enough for most >> files on a system. There will be relatively few files on a system that >> you'll want to do a full checksum on. > > I agree. > >> Thanks, >> >> Trevor >> >> On 12/26/2010 12:40 AM, Luke Kanies wrote: >>> On Dec 23, 2010, at 4:58, Trevor Vaughan <[email protected]> wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Brice, >>>> >>>> Thanks for the feedback, this is good stuff! >>>> >>>>> >>>>> That's more or less what rsync does. For sourced files we could even use >>>>> HTTP If-Modified-Since and/or If-None-Match to perform the check (and >>>>> thus the check would be done server side). >>>> >>>> Yes, I briefly looked at the Rsync algorithm papers to see if I could >>>> figure out how to re-implement it in Ruby but just using the native >>>> Rsync libraries might be a better call. However, that would introduce an >>>> external dependency. >>>> >>>>> >>>>>> 4) For ultimate speed, a direct comparison should be an option as a >>>>>> checksum type. Directly comparing the content of the in-memory file >>>>>> and the target file appears to be twice as fast as an MD5 checksum. >>>>>> This would not be feasible for a 'source'. >>>>> >>>>> That might be faster, but please don't re-introduce the slurp the whole >>>>> file in memory syndrom. >>>> >>>> It seems that MD5 might be doing it anyway. When I tried a block-wise >>>> 'comp', it was *much* slower and I think it was even slower than MD5 (or >>>> close anyway) which means that MD5 is reading the whole blob into memory >>>> to work on it anyway! If we're going to take the memory hit, let's just >>>> take it and compare the two items. >>> >>> Is this an md5 script you wrote, or are you using the Puppet code? >>> We've worked to add 'stream' checksum types that checksum the file a >>> bit at a time. >>> >>> I expect that most of those are actually a good bit slower than just >>> reading the whole thing in and checksumming, but they're faster by >>> being less ram-efficient. >>> >>>>> That's really something I'd like to work on. Unfortunately this is >>>>> really complex stuff. The file type is one of the biggest type and even >>>>> though I already worked on it, I'm not sure I grasped enough to be able >>>>> to fully refactor it for a different inner working. >>>> >>>> Completely agreed. I'll do what I can to help, but my outside time is >>>> severely limited. >>> >>> >> >> - -- >> Trevor Vaughan >> Vice President, Onyx Point, Inc. >> email: [email protected] >> phone: 410-541-ONYX (6699) >> pgp: 0x6C701E94 >> >> - -- This account not approved for unencrypted sensitive information -- >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.11 (GNU/Linux) >> >> iQEcBAEBAgAGBQJNGx5LAAoJECNCGV1OLcypUrQH/RjbY56VfBunWk5rV1cgMCSO >> VMzXjqY0HhyAvOtYpcesYDpvPHNsnSBx3684TCX1+VYfy8vh9lFy6CxEqB3ohwN5 >> gHjIBs2c6ZpT8UloywkwbMwAkFnqFXMfQ2/ELOfGvKsHwWq+Z9uVxW/vPxmswPJ0 >> U6qiDnmk762OfRyD0/sBNsYljnUXwDBidWC9up9WO+hEz9bSr+NLSxMc+5PsVjyl >> kRtGtnBNqnE8Sw8VEjGKjrHkuoCR9pqAiGU2KM4h827zkog5oy0ghPolnEJXMD82 >> ErEXo+Y6C7xZc7U62+0eS96Zb0LZi9B412c5PpB08TEP18lJwCwSWWtY47dSJgQ= >> =FPbW >> -----END PGP SIGNATURE----- >> <tvaughan.vcf> > > > -- > On Bureaucracy.... > The Pythagorean theorem contains 24 words. Archimedes > Principle, 67. The Ten Commandments, 179. The American Declaration of > Independence, 300. And recent legislation in Europe concerning when > and where to smoke, 23,942. -- The European, June 23-29, 1995 > --------------------------------------------------------------------- > Luke Kanies -|- http://puppetlabs.com -|- +1(615)594-8199 > > > > > -- Trevor Vaughan Vice President, Onyx Point, Inc (410) 541-6699 [email protected] -- This account not approved for unencrypted proprietary information -- -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.
