On Dec 29, 2010, at 3:41 AM, Trevor Vaughan wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This was just a comparison script that I wrote. > > Checking through the Ruby source, it definitely looks like the digest > methods just pull the entire "string" into memory.
Yep, that's why we had to go through the effort of adding stream summing. > If this is a file, then we're already taking the memory hit that you > would take by just comparing the two files. Not with Puppet we're not - we do stream summing for both files, and then compare the sums. > This makes complete sense since Digest doesn't know what you're passing. > > I will note that it looks like chunking a file and performing the > checksum might take twice as long. This is the big thing - I agree that this could all be much faster via the mechanisms you've proposed. It's just important to know how Puppet works internally right now and why. The stream summing is very important for us because if Ruby loads 100mb files into memory, it basically never frees that memory again. Thus, we give up speed for drastically better ram efficiency, and we don't want to go back to the bad old ram days. > I'm thinking that size+time (similar to rsync) might be enough for most > files on a system. There will be relatively few files on a system that > you'll want to do a full checksum on. I agree. > Thanks, > > Trevor > > On 12/26/2010 12:40 AM, Luke Kanies wrote: >> On Dec 23, 2010, at 4:58, Trevor Vaughan <[email protected]> wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Brice, >>> >>> Thanks for the feedback, this is good stuff! >>> >>>> >>>> That's more or less what rsync does. For sourced files we could even use >>>> HTTP If-Modified-Since and/or If-None-Match to perform the check (and >>>> thus the check would be done server side). >>> >>> Yes, I briefly looked at the Rsync algorithm papers to see if I could >>> figure out how to re-implement it in Ruby but just using the native >>> Rsync libraries might be a better call. However, that would introduce an >>> external dependency. >>> >>>> >>>>> 4) For ultimate speed, a direct comparison should be an option as a >>>>> checksum type. Directly comparing the content of the in-memory file >>>>> and the target file appears to be twice as fast as an MD5 checksum. >>>>> This would not be feasible for a 'source'. >>>> >>>> That might be faster, but please don't re-introduce the slurp the whole >>>> file in memory syndrom. >>> >>> It seems that MD5 might be doing it anyway. When I tried a block-wise >>> 'comp', it was *much* slower and I think it was even slower than MD5 (or >>> close anyway) which means that MD5 is reading the whole blob into memory >>> to work on it anyway! If we're going to take the memory hit, let's just >>> take it and compare the two items. >> >> Is this an md5 script you wrote, or are you using the Puppet code? >> We've worked to add 'stream' checksum types that checksum the file a >> bit at a time. >> >> I expect that most of those are actually a good bit slower than just >> reading the whole thing in and checksumming, but they're faster by >> being less ram-efficient. >> >>>> That's really something I'd like to work on. Unfortunately this is >>>> really complex stuff. The file type is one of the biggest type and even >>>> though I already worked on it, I'm not sure I grasped enough to be able >>>> to fully refactor it for a different inner working. >>> >>> Completely agreed. I'll do what I can to help, but my outside time is >>> severely limited. >> >> > > - -- > Trevor Vaughan > Vice President, Onyx Point, Inc. > email: [email protected] > phone: 410-541-ONYX (6699) > pgp: 0x6C701E94 > > - -- This account not approved for unencrypted sensitive information -- > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > > iQEcBAEBAgAGBQJNGx5LAAoJECNCGV1OLcypUrQH/RjbY56VfBunWk5rV1cgMCSO > VMzXjqY0HhyAvOtYpcesYDpvPHNsnSBx3684TCX1+VYfy8vh9lFy6CxEqB3ohwN5 > gHjIBs2c6ZpT8UloywkwbMwAkFnqFXMfQ2/ELOfGvKsHwWq+Z9uVxW/vPxmswPJ0 > U6qiDnmk762OfRyD0/sBNsYljnUXwDBidWC9up9WO+hEz9bSr+NLSxMc+5PsVjyl > kRtGtnBNqnE8Sw8VEjGKjrHkuoCR9pqAiGU2KM4h827zkog5oy0ghPolnEJXMD82 > ErEXo+Y6C7xZc7U62+0eS96Zb0LZi9B412c5PpB08TEP18lJwCwSWWtY47dSJgQ= > =FPbW > -----END PGP SIGNATURE----- > <tvaughan.vcf> -- On Bureaucracy.... The Pythagorean theorem contains 24 words. Archimedes Principle, 67. The Ten Commandments, 179. The American Declaration of Independence, 300. And recent legislation in Europe concerning when and where to smoke, 23,942. -- The European, June 23-29, 1995 --------------------------------------------------------------------- Luke Kanies -|- http://puppetlabs.com -|- +1(615)594-8199 -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.
