[Haskell-cafe] Re: [Haskell] Fingerprints and hashing
Hello Simon, Thursday, October 11, 2007, 1:42:31 PM, you wrote: For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations. lots of hash-related links was collected at http://www.encode.ru/forums/index.php?action=vthreadforum=1topic=413 -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell] Fingerprints and hashing
Simon Peyton Jones writes: We are all familiar with the idea of an MD5 checksum, which provides a reliable fingerprint for a file, usually 128 bits or so... For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations... Below I give an informal specification of such a library. I can think of any number of implementations, but I'd like one that is (a) robust and (b) efficient. By robust I meant that with, say, a 128-bit fingerprint the chance of accidental collision is truly low; a non-robust implementation uses the 128-bit space in a non-uniform way [trivial example: gets stuck on zero]... The module signature would be something like this: data Fingerprint initialPrint :: Fingerprint class Fingerprintable t where fingerPrint :: t - Fingerprint - FingerPrint instance Fingerprintable Char instance Fingerprintable Int instance Fingerprintable Word32 -- etc The fingerPrint operation here is expressed in the traditional way: an accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte stream). However, very important, I also want a more compositional form: instance Fingerprintable Fingerprint so that I can do something like fp :: Expr - Fingerprint fp (App e1 e2) = fp e1 `fingerPrint` fp e2 [For Salil anad Michael: apply the fingerPrint function to a fingerprint] An obvious implementation of instance Fingerprintable Fingerprint is to divide one fingerprint into words, and feed it into the other via the Word32 instance. But it's not clear to me whether or not that is robust... I'd like one other thing: to be able to trade cost for accuracy. I'd like a 128-bit fingerprint with essentially zero chance of having two distinct values that map to the same fingerprint; and a one-word fingerprint that was a lot more efficient, but where collisions would be entirely possible (again hash-consing is an example). Simon, If I wanted this functionality, I'd call in the professionals (like Salil Vadhan, Michael Rabin, and Andrei Broder, whom I have cc'ed on this message). You sent this mail to the Haskell list, but I think the Haskell content of this problem is trivial compared to the algorithmic problems you pose. I remember, for example, a problem in the distant past of SRC Modula-3, where fingerprinting did not behave as expected because the source of the fingerprint module was itself fingerprinted. Here are some algorithmic references: http://tinyurl.com/yoeusg, http://tinyurl.com/2a2yfj - Modula-3 code for compositional 64-bit fingerprints, polynomial chosen by (literally) flipping a coin http://citeseer.ist.psu.edu/broder93some.html - Andrei's paper on applications of fingerprinting How does laziness play into this problem? If you don't mind building up a lot of thunks, you can probably just use several fingerprinting algorithms in parallel and then pull just the one you need... Norman ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] Fingerprints and hashing
Neil Mitchell wrote: Hi Simon, We are all familiar with the idea of an MD5 checksum, which provides a reliable fingerprint for a file, usually 128 bits or so. If the file changes, the fingerprint is (almost) certain to do so too. There are lots of techniques: CRC, shar?, MD5, etc. I believe the basic operations are all in the Crypto library: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Crypto-3.0.3 - see Data.Digest.SHA1 and Data.Digest.MD5 - digest is simply another word for fingerprint in this sense. The Crypto library does indeed provide those operations, but the implementations are naive and hence extremely slow. Fine for hashing an individual password, but not appropriate for ten thousand identifiers in a compiler. b ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
[Haskell] Fingerprints and hashing
Dear Haskellers Here's a challenge. We are all familiar with the idea of an MD5 checksum, which provides a reliable fingerprint for a file, usually 128 bits or so. If the file changes, the fingerprint is (almost) certain to do so too. There are lots of techniques: CRC, shar?, MD5, etc. For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations. Below I give an informal specification of such a library. I can think of any number of implementations, but I'd like one that is (a) robust and (b) efficient. By robust I meant that with, say, a 128-bit fingerprint the chance of accidental collision is truly low; a non-robust implementation uses the 128-bit space in a non-uniform way [trivial example: gets stuck on zero]. Any takers? I bet some of you have this already, in some form. Simon The module signature would be something like this: data Fingerprint instance Eq Fingerprint instance Ord Fingerprint initialPrint :: Fingerprint class Fingerprintable t where fingerPrint :: t - Fingerprint - FingerPrint instance Fingerprintable Char instance Fingerprintable Int instance Fingerprintable Word32 -- etc The fingerPrint operation here is expressed in the traditional way: an accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte stream). However, very important, I also want a more compositional form: instance Fingerprintable Fingerprint so that I can do something like fp :: Expr - Fingerprint fp (App e1 e2) = fp e1 `fingerPrint` fp e2 Yes, I could thread the accumulator, thus: fp' :: Expr - Fingerprint - Fingerprint fp' (App e1 e2) p = fp' e1 (fp' e2 p) but I can't *always* do that; or even if I can, I may already *have* a fingerprint for e1 in my hand, and it's painful to re-traverse e1. Think of hash-consing. An obvious implementation of instance Fingerprintable Fingerprint is to divide one fingerprint into words, and feed it into the other via the Word32 instance. But it's not clear to me whether or not that is robust. (Incidentally, one could define the class thus: class Fingerprintable t where fingerPrint :: t - FingerPrint plus plusFP :: Fingerprint - Fingerprint - Fingerprint But I think it'd be less efficient if that was the only way, because of all those now-compulsory plusFP operations.) I'd like one other thing: to be able to trade cost for accuracy. I'd like a 128-bit fingerprint with essentially zero chance of having two distinct values that map to the same fingerprint; and a one-word fingerprint that was a lot more efficient, but where collisions would be entirely possible (again hash-consing is an example). ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] Fingerprints and hashing
Hi Simon, We are all familiar with the idea of an MD5 checksum, which provides a reliable fingerprint for a file, usually 128 bits or so. If the file changes, the fingerprint is (almost) certain to do so too. There are lots of techniques: CRC, shar?, MD5, etc. I believe the basic operations are all in the Crypto library: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Crypto-3.0.3 - see Data.Digest.SHA1 and Data.Digest.MD5 - digest is simply another word for fingerprint in this sense. However, your fingerprint stuff sounds a lot more like a request for a Hash function - rather than something that operates over streams of bytes. To get both, I'd recommend something like taking the digest of the data after calling show, or after serialising it to a ByteString with the binary library. Doing it this way means you have no additional need for a FingerPrint class but can reuse the existing Show/Binary class. Thanks Neil ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] Fingerprints and hashing
If compositionality is important, at least Rabin's fingerprints are worth considering: http://citeseer.ist.psu.edu/broder93some.html They have the neat property that the fingerprint of a concatenation of strings can be cheaply computed from the fingerprints of the constituents. I think this effectively means that the plusFP operation can be made associative, which might offer optimization opportunities. I remember implementing it some seven years ago when prototyping for a C implementation. The code is lost, though. I can give it a shot again, if this is considered a reasonable algorithm. Lauri ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
RE: [Haskell] Fingerprints and hashing
Interesting! The associativity property is the kind of thing I was after. Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs). I only care that the latter is robust in the sense of having low probabilty of collision. So the Rabin scheme does more than I really need (which is probably fine). On page 1 he draws a distinction between fingerprinting and hashing, which relates to my last wish. Does one need a different technique for the two, or is it enough to reduce the number of bits in the Rabin scheme, and thereby get a hashing scheme? Strangely the paper gives no comparison with other fingerprinting schemes. Simon | -Original Message- | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko | Sent: 11 October 2007 11:27 | To: haskell | Subject: Re: [Haskell] Fingerprints and hashing | | If compositionality is important, at least Rabin's fingerprints are | worth considering: http://citeseer.ist.psu.edu/broder93some.html | | They have the neat property that the fingerprint of a concatenation of | strings can be cheaply computed from the fingerprints of the | constituents. I think this effectively means that the plusFP operation | can be made associative, which might offer optimization opportunities. | | I remember implementing it some seven years ago when prototyping for a | C implementation. The code is lost, though. I can give it a shot | again, if this is considered a reasonable algorithm. | | | Lauri | ___ | Haskell mailing list | Haskell@haskell.org | http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
RE: [Haskell] Fingerprints and hashing
On Thu, 2007-10-11 at 12:28 +0100, Simon Peyton-Jones wrote: Interesting! The associativity property is the kind of thing I was after. Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs). I only care that the latter is robust in the sense of having low probabilty of collision. So the Rabin scheme does more than I really need (which is probably fine). On page 1 he draws a distinction between fingerprinting and hashing, which relates to my last wish. Does one need a different technique for the two, or is it enough to reduce the number of bits in the Rabin scheme, and thereby get a hashing scheme? Strangely the paper gives no comparison with other fingerprinting schemes. Simon | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko | | If compositionality is important, at least Rabin's fingerprints are | worth considering: http://citeseer.ist.psu.edu/broder93some.html Note that Rabin's fingerprint algorithm is the same as CRC (http://en.wikipedia.org/wiki/Cyclic_redundancy_check). I'm not aware of people using CRCs for hashing inside a program, but they should work fine and be fairly fast. I can think of two caveats. 1) With CRC and a fixed polynomial, it is very easy for an adversary to pick multiple inputs that give the same fingerprint/hash code. (Rabin avoids this by picking a random polynomial after the adversary has selected the inputs.) Depending on the application, this may be very important, or it may not matter at all. 2) The straightforward approach for instance Fingerprintable Fingerprint works very poorly for CRCs. Considering balanced binary trees with four leaves, and the obvious compositional hash function, then for all a,b,c,d we would have fingerprint ((a,b),(c,d)) == fingerprint ((a,c),(b,d)), and fingerprint ((a,b),(b,d)) == fingerprint ((a,c),(c,d)). To avoid this, the fingerprint would have to include not only the remainder, but the number of bits involved in creating the fingerprint. Then fingerprinting a fingerprint would require time logarithmic in this number of bits. Carl Witty ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] Fingerprints and hashing
One example of such an minusFP (not recommended) is (foldr xor 0): Obviously I meant that FP = foldr xor 0. minusFP would be a simple unfolding of this. Dan Weston wrote: I am zero training in cryptography, but I would think that if in addition to FP(as ++ bs) = FP(bs) `plusFPFlipped` FP(as) (I think the flipped plusFP def is more intuitive) there also existed a minusFP for all f and x such that FP(bs) = FP(as ++ bs) `minusFP` FP(as) then that might facilitate common subexpression refactorization in the merging of two hashes of memoized expressions. One example of such an minusFP (not recommended) is (foldr xor 0): That xor is its own inverse is no advantage. But xor is associative and (xor 0) is an identity, so FP([] ++ as) = FP(as ++ []) = FP(as) as expected. There must be more robust such monoidal functors out there. Refactorization could be limited to respect substructure boundaries reflected in the serialization. Dan Weston Simon Peyton-Jones wrote: Interesting! The associativity property is the kind of thing I was after. Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs). I only care that the latter is robust in the sense of having low probabilty of collision. So the Rabin scheme does more than I really need (which is probably fine). On page 1 he draws a distinction between fingerprinting and hashing, which relates to my last wish. Does one need a different technique for the two, or is it enough to reduce the number of bits in the Rabin scheme, and thereby get a hashing scheme? Strangely the paper gives no comparison with other fingerprinting schemes. Simon | -Original Message- | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko | Sent: 11 October 2007 11:27 | To: haskell | Subject: Re: [Haskell] Fingerprints and hashing | | If compositionality is important, at least Rabin's fingerprints are | worth considering: http://citeseer.ist.psu.edu/broder93some.html | | They have the neat property that the fingerprint of a concatenation of | strings can be cheaply computed from the fingerprints of the | constituents. I think this effectively means that the plusFP operation | can be made associative, which might offer optimization opportunities. | | I remember implementing it some seven years ago when prototyping for a | C implementation. The code is lost, though. I can give it a shot | again, if this is considered a reasonable algorithm. | | | Lauri | ___ | Haskell mailing list | Haskell@haskell.org | http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] Fingerprints and hashing
I am zero training in cryptography, but I would think that if in addition to FP(as ++ bs) = FP(bs) `plusFPFlipped` FP(as) (I think the flipped plusFP def is more intuitive) there also existed a minusFP for all f and x such that FP(bs) = FP(as ++ bs) `minusFP` FP(as) then that might facilitate common subexpression refactorization in the merging of two hashes of memoized expressions. One example of such an minusFP (not recommended) is (foldr xor 0): That xor is its own inverse is no advantage. But xor is associative and (xor 0) is an identity, so FP([] ++ as) = FP(as ++ []) = FP(as) as expected. There must be more robust such monoidal functors out there. Refactorization could be limited to respect substructure boundaries reflected in the serialization. Dan Weston Simon Peyton-Jones wrote: Interesting! The associativity property is the kind of thing I was after. Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs). I only care that the latter is robust in the sense of having low probabilty of collision. So the Rabin scheme does more than I really need (which is probably fine). On page 1 he draws a distinction between fingerprinting and hashing, which relates to my last wish. Does one need a different technique for the two, or is it enough to reduce the number of bits in the Rabin scheme, and thereby get a hashing scheme? Strangely the paper gives no comparison with other fingerprinting schemes. Simon | -Original Message- | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko | Sent: 11 October 2007 11:27 | To: haskell | Subject: Re: [Haskell] Fingerprints and hashing | | If compositionality is important, at least Rabin's fingerprints are | worth considering: http://citeseer.ist.psu.edu/broder93some.html | | They have the neat property that the fingerprint of a concatenation of | strings can be cheaply computed from the fingerprints of the | constituents. I think this effectively means that the plusFP operation | can be made associative, which might offer optimization opportunities. | | I remember implementing it some seven years ago when prototyping for a | C implementation. The code is lost, though. I can give it a shot | again, if this is considered a reasonable algorithm. | | | Lauri | ___ | Haskell mailing list | Haskell@haskell.org | http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell