On 2010-10-18 03:44, Justin Lebar wrote: > So it appears that 13% of my CPU time is spent computing md4 hashes, > while another 25% is spent in hash_source_code_string but outside the > MD4 code. > > To someone new to the code like me, it appears that there's some room > for optimization here.
Indeed. (This was by the way mentioned on the list a couple of weeks ago: http://www.mail-archive.com/ccache@lists.samba.org/msg00532.html) > [...] > If all we really care about is finding the strings "__DATE__" and > "__TIME__", there are faster algorithms than a character-by-character > search. Yes, the current algorithm isn't particularly good, but it was easy enough to implement quickly when the direct mode was more or less just a proof of concept. Also, I wanted as few false negatives as possible, which is why I decided to care about being inside or outside strings. > (Note also that the current implementation copies the whole file into > hashbuf one character at a time. Again, do the benefits of stripping > out comments really offset this?) While I think that stripping out comments is nice, I agree that it has to be weighed against the cost, and the same thing for the "false negatives in comments" thing. So I guess my answer is "it depends on the gain". Do you want to work on this? That would be awesome! I currently don't have much ccache time, and when I get some, I would like to work on other things first. > * Why does ccache still use MD4? Surely there's a better / faster hash > out there. I noticed that ccache includes murmurhash, but it doesn't > seem like it's used in too many places. There's probably a good reason > for this, but it's not too apparent to me. > [...] I added murmurhash to get a good hash function for some hash tables I introduced when implementing the direct mode, which is a relatively late invention. I'm far from an expert on hash functions, but surely murmurhash and similar functions (that are designed to be good hash functions for a hash table) aren't suitable to use for identifying (without verification) arbitrary content like a cryptograhic hash function is. Even the 64-bit version of murmurhash has way too high collision rate. MD4 has been there from the start and neither Tridge or I have seen any reason to switch it. MD5, SHA1 and other even more modern cryptograhic hash functions are indeed stronger but also slower, and the increased resistance against various crypto attacks doesn't seem necessary in a tool like ccache. That said, I'm sure there nowadays may exist hash functions that are both better (i.e., with lower collision rate) AND faster than MD4. Do you (or anyone else) know of any with properties that would be a good fit for ccache? -- Joel _______________________________________________ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache