[Haskell-cafe] Re: [Haskell] Fingerprints and hashing

2007-10-19 Thread Bulat Ziganshin
Hello Simon,

Thursday, October 11, 2007, 1:42:31 PM, you wrote:

 For various applications (including identifying common
 sub-expressions, and version tracking in GHC), I'd like a Haskell
 library that supports simple fingerprint operations.

lots of hash-related links was collected at

http://www.encode.ru/forums/index.php?action=vthreadforum=1topic=413



-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell] Fingerprints and hashing

2007-10-15 Thread Norman Ramsey
Simon Peyton Jones writes:

  We are all familiar with the idea of an MD5 checksum, which provides a
  reliable fingerprint for a file, usually 128 bits or so...

  For various applications (including identifying common sub-expressions, and
  version tracking in GHC), I'd like a Haskell library that supports simple
  fingerprint operations...
  
  Below I give an informal specification of such a library.  I can think of
  any number of implementations, but I'd like one that is (a) robust and (b)
  efficient.  By robust I meant that with, say, a 128-bit fingerprint the
  chance of accidental collision is truly low; a non-robust implementation
  uses the 128-bit space in a non-uniform way [trivial example: gets stuck on
  zero]...

  The module signature would be something like this:
  
data Fingerprint
initialPrint :: Fingerprint
  
class Fingerprintable t where
  fingerPrint :: t - Fingerprint - FingerPrint
  
instance Fingerprintable Char
instance Fingerprintable Int
instance Fingerprintable Word32
-- etc
  
  The fingerPrint operation here is expressed in the traditional way: an
  accumulator (say 128 bits) into which one stuffs successive values (e.g. a
  byte stream).
  
  However, very important, I also want a more compositional form:
  
instance Fingerprintable Fingerprint
  
  so that I can do something like
  
fp :: Expr - Fingerprint
fp (App e1 e2) = fp e1 `fingerPrint` fp e2

[For Salil anad Michael: apply the fingerPrint function to a fingerprint]

  An obvious implementation of instance Fingerprintable Fingerprint is to
  divide one fingerprint into words, and feed it into the other via the
  Word32 instance.  But it's not clear to me whether or not that is robust...
  
  I'd like one other thing: to be able to trade cost for accuracy.  I'd like
  a 128-bit fingerprint with essentially zero chance of having two distinct
  values that map to the same fingerprint; and a one-word fingerprint that
  was a lot more efficient, but where collisions would be entirely possible
  (again hash-consing is an example).

Simon,

If I wanted this functionality, I'd call in the professionals (like
Salil Vadhan, Michael Rabin, and Andrei Broder, whom I have cc'ed on
this message).  You sent this mail to the Haskell list, but I think
the Haskell content of this problem is trivial compared to the
algorithmic problems you pose.  I remember, for example, a problem in
the distant past of SRC Modula-3, where fingerprinting did not behave
as expected because the source of the fingerprint module was itself
fingerprinted.

Here are some algorithmic references:

  http://tinyurl.com/yoeusg, http://tinyurl.com/2a2yfj
- Modula-3 code for compositional 64-bit fingerprints,
  polynomial chosen by (literally) flipping a coin

  http://citeseer.ist.psu.edu/broder93some.html
- Andrei's paper on applications of fingerprinting

How does laziness play into this problem?  If you don't mind building
up a lot of thunks, you can probably just use several fingerprinting
algorithms in parallel and then pull just the one you need...


Norman


___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] Fingerprints and hashing

2007-10-12 Thread Bryan O'Sullivan

Neil Mitchell wrote:

Hi Simon,


We are all familiar with the idea of an MD5 checksum, which provides a reliable 
fingerprint for a file, usually 128 bits or so.  If the file changes, the 
fingerprint is (almost) certain to do so too.  There are lots of techniques: CRC, shar?, 
MD5, etc.


I believe the basic operations are all in the Crypto library:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Crypto-3.0.3
- see Data.Digest.SHA1 and Data.Digest.MD5 - digest is simply another
word for fingerprint in this sense.


The Crypto library does indeed provide those operations, but the 
implementations are naive and hence extremely slow.  Fine for hashing an 
individual password, but not appropriate for ten thousand identifiers in 
a compiler.


b
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


[Haskell] Fingerprints and hashing

2007-10-11 Thread Simon Peyton-Jones
Dear Haskellers

Here's a challenge.

We are all familiar with the idea of an MD5 checksum, which provides a reliable 
fingerprint for a file, usually 128 bits or so.  If the file changes, the 
fingerprint is (almost) certain to do so too.  There are lots of techniques: 
CRC, shar?, MD5, etc.

For various applications (including identifying common sub-expressions, and 
version tracking in GHC), I'd like a Haskell library that supports simple 
fingerprint operations.

Below I give an informal specification of such a library.  I can think of any 
number of implementations, but I'd like one that is (a) robust and (b) 
efficient.  By robust I meant that with, say, a 128-bit fingerprint the 
chance of accidental collision is truly low; a non-robust implementation uses 
the 128-bit space in a non-uniform way [trivial example: gets stuck on zero].

Any takers?  I bet some of you have this already, in some form.

Simon



The module signature would be something like this:

  data Fingerprint
  instance Eq Fingerprint
  instance Ord Fingerprint

  initialPrint :: Fingerprint

  class Fingerprintable t where
fingerPrint :: t - Fingerprint - FingerPrint

  instance Fingerprintable Char
  instance Fingerprintable Int
  instance Fingerprintable Word32
  -- etc

The fingerPrint operation here is expressed in the traditional way: an 
accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte 
stream).

However, very important, I also want a more compositional form:

  instance Fingerprintable Fingerprint

so that I can do something like

  fp :: Expr - Fingerprint
  fp (App e1 e2) = fp e1 `fingerPrint` fp e2

Yes, I could thread the accumulator, thus:

  fp' :: Expr - Fingerprint - Fingerprint
  fp' (App e1 e2) p = fp' e1 (fp' e2 p)

but I can't *always* do that; or even if I can, I may already *have* a 
fingerprint for e1 in my hand, and it's painful to re-traverse e1.  Think of 
hash-consing.

An obvious implementation of instance Fingerprintable Fingerprint is to 
divide one fingerprint into words, and feed it into the other via the Word32 
instance.  But it's not clear to me whether or not that is robust.


(Incidentally, one could define the class thus:
  class Fingerprintable t where
fingerPrint :: t - FingerPrint
plus
  plusFP :: Fingerprint - Fingerprint - Fingerprint
But I think it'd be less efficient if that was the only way, because of all 
those now-compulsory plusFP operations.)


I'd like one other thing: to be able to trade cost for accuracy.  I'd like a 
128-bit fingerprint with essentially zero chance of having two distinct values 
that map to the same fingerprint; and a one-word fingerprint that was a lot 
more efficient, but where collisions would be entirely possible (again 
hash-consing is an example).



___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] Fingerprints and hashing

2007-10-11 Thread Neil Mitchell
Hi Simon,

 We are all familiar with the idea of an MD5 checksum, which provides a 
 reliable fingerprint for a file, usually 128 bits or so.  If the file 
 changes, the fingerprint is (almost) certain to do so too.  There are lots of 
 techniques: CRC, shar?, MD5, etc.

I believe the basic operations are all in the Crypto library:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Crypto-3.0.3
- see Data.Digest.SHA1 and Data.Digest.MD5 - digest is simply another
word for fingerprint in this sense.

However, your fingerprint stuff sounds a lot more like a request for a
Hash function - rather than something that operates over streams of
bytes. To get both, I'd recommend something like taking the digest of
the data after calling show, or after serialising it to a ByteString
with the binary library. Doing it this way means you have no
additional need for a FingerPrint class but can reuse the existing
Show/Binary class.

Thanks

Neil
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] Fingerprints and hashing

2007-10-11 Thread Lauri Alanko
If compositionality is important, at least Rabin's fingerprints are
worth considering: http://citeseer.ist.psu.edu/broder93some.html

They have the neat property that the fingerprint of a concatenation of
strings can be cheaply computed from the fingerprints of the
constituents. I think this effectively means that the plusFP operation
can be made associative, which might offer optimization opportunities.

I remember implementing it some seven years ago when prototyping for a
C implementation. The code is lost, though. I can give it a shot
again, if this is considered a reasonable algorithm.


Lauri
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


RE: [Haskell] Fingerprints and hashing

2007-10-11 Thread Simon Peyton-Jones
Interesting! The associativity property is the kind of thing I was after.  
Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs).  I 
only care that the latter is robust in the sense of having low probabilty of 
collision.  So the Rabin scheme does more than I really need (which is probably 
fine).

On page 1 he draws a distinction between fingerprinting and hashing, which 
relates to my last wish.  Does one need a different technique for the two, or 
is it enough to reduce the number of bits in the Rabin scheme, and thereby get 
a hashing scheme?

Strangely the paper gives no comparison with other fingerprinting schemes.

Simon

| -Original Message-
| From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko
| Sent: 11 October 2007 11:27
| To: haskell
| Subject: Re: [Haskell] Fingerprints and hashing
|
| If compositionality is important, at least Rabin's fingerprints are
| worth considering: http://citeseer.ist.psu.edu/broder93some.html
|
| They have the neat property that the fingerprint of a concatenation of
| strings can be cheaply computed from the fingerprints of the
| constituents. I think this effectively means that the plusFP operation
| can be made associative, which might offer optimization opportunities.
|
| I remember implementing it some seven years ago when prototyping for a
| C implementation. The code is lost, though. I can give it a shot
| again, if this is considered a reasonable algorithm.
|
|
| Lauri
| ___
| Haskell mailing list
| Haskell@haskell.org
| http://www.haskell.org/mailman/listinfo/haskell
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


RE: [Haskell] Fingerprints and hashing

2007-10-11 Thread Carl Witty
On Thu, 2007-10-11 at 12:28 +0100, Simon Peyton-Jones wrote:
 Interesting! The associativity property is the kind of thing I was after.  
 Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs).  I 
 only care that the latter is robust in the sense of having low probabilty of 
 collision.  So the Rabin scheme does more than I really need (which is 
 probably fine).
 
 On page 1 he draws a distinction between fingerprinting and hashing, which 
 relates to my last wish.  Does one need a different technique for the two, or 
 is it enough to reduce the number of bits in the Rabin scheme, and thereby 
 get a hashing scheme?
 
 Strangely the paper gives no comparison with other fingerprinting schemes.
 
 Simon
 | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko
 |
 | If compositionality is important, at least Rabin's fingerprints are
 | worth considering: http://citeseer.ist.psu.edu/broder93some.html

Note that Rabin's fingerprint algorithm is the same as CRC
(http://en.wikipedia.org/wiki/Cyclic_redundancy_check).

I'm not aware of people using CRCs for hashing inside a program, but
they should work fine and be fairly fast.

I can think of two caveats.  

1) With CRC and a fixed polynomial, it is very easy for an adversary
to pick multiple inputs that give the same fingerprint/hash code.
(Rabin avoids this by picking a random polynomial after the adversary
has selected the inputs.)  Depending on the application, this may be
very important, or it may not matter at all.

2) The straightforward approach for instance Fingerprintable
Fingerprint works very poorly for CRCs.  Considering balanced binary
trees with four leaves, and the obvious compositional hash function,
then for all a,b,c,d we would have
fingerprint ((a,b),(c,d)) == fingerprint ((a,c),(b,d)),
and
fingerprint ((a,b),(b,d)) == fingerprint ((a,c),(c,d)).

To avoid this, the fingerprint would have to include not only the
remainder, but the number of bits involved in creating the fingerprint.
Then fingerprinting a fingerprint would require time logarithmic in this
number of bits.

Carl Witty


___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] Fingerprints and hashing

2007-10-11 Thread Dan Weston

 One example of such an minusFP (not recommended) is (foldr xor 0):

Obviously I meant that FP = foldr xor 0. minusFP would be a simple 
unfolding of this.


Dan Weston wrote:
I am zero training in cryptography, but I would think that if in 
addition to


  FP(as ++ bs) = FP(bs) `plusFPFlipped` FP(as)

(I think the flipped plusFP def is more intuitive)

there also existed a minusFP for all f and x such that

  FP(bs) = FP(as ++ bs) `minusFP` FP(as)

then that might facilitate common subexpression refactorization in the 
merging of two hashes of memoized expressions.


One example of such an minusFP (not recommended) is (foldr xor 0):

That xor is its own inverse is no advantage. But xor is associative and 
(xor 0) is an identity, so FP([] ++ as) = FP(as ++ []) = FP(as) as 
expected. There must be more robust such monoidal functors out there.


Refactorization could be limited to respect substructure boundaries 
reflected in the serialization.


Dan Weston

Simon Peyton-Jones wrote:
Interesting! The associativity property is the kind of thing I was 
after.  Except that I don't really care if FP(as ++ bs) = FP(as) 
`plusFP` FP(bs).  I only care that the latter is robust in the sense 
of having low probabilty of collision.  So the Rabin scheme does more 
than I really need (which is probably fine).


On page 1 he draws a distinction between fingerprinting and hashing, 
which relates to my last wish.  Does one need a different technique 
for the two, or is it enough to reduce the number of bits in the Rabin 
scheme, and thereby get a hashing scheme?


Strangely the paper gives no comparison with other fingerprinting 
schemes.


Simon

| -Original Message-
| From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko

| Sent: 11 October 2007 11:27
| To: haskell
| Subject: Re: [Haskell] Fingerprints and hashing
|
| If compositionality is important, at least Rabin's fingerprints are
| worth considering: http://citeseer.ist.psu.edu/broder93some.html
|
| They have the neat property that the fingerprint of a concatenation of
| strings can be cheaply computed from the fingerprints of the
| constituents. I think this effectively means that the plusFP operation
| can be made associative, which might offer optimization opportunities.
|
| I remember implementing it some seven years ago when prototyping for a
| C implementation. The code is lost, though. I can give it a shot
| again, if this is considered a reasonable algorithm.
|
|
| Lauri
| ___
| Haskell mailing list
| Haskell@haskell.org
| http://www.haskell.org/mailman/listinfo/haskell
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell








___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] Fingerprints and hashing

2007-10-11 Thread Dan Weston
I am zero training in cryptography, but I would think that if in 
addition to


  FP(as ++ bs) = FP(bs) `plusFPFlipped` FP(as)

(I think the flipped plusFP def is more intuitive)

there also existed a minusFP for all f and x such that

  FP(bs) = FP(as ++ bs) `minusFP` FP(as)

then that might facilitate common subexpression refactorization in the 
merging of two hashes of memoized expressions.


One example of such an minusFP (not recommended) is (foldr xor 0):

That xor is its own inverse is no advantage. But xor is associative and 
(xor 0) is an identity, so FP([] ++ as) = FP(as ++ []) = FP(as) as 
expected. There must be more robust such monoidal functors out there.


Refactorization could be limited to respect substructure boundaries 
reflected in the serialization.


Dan Weston

Simon Peyton-Jones wrote:

Interesting! The associativity property is the kind of thing I was after.  
Except that I don't really care if FP(as ++ bs) = FP(as) `plusFP` FP(bs).  I 
only care that the latter is robust in the sense of having low probabilty of 
collision.  So the Rabin scheme does more than I really need (which is probably 
fine).

On page 1 he draws a distinction between fingerprinting and hashing, which 
relates to my last wish.  Does one need a different technique for the two, or 
is it enough to reduce the number of bits in the Rabin scheme, and thereby get 
a hashing scheme?

Strangely the paper gives no comparison with other fingerprinting schemes.

Simon

| -Original Message-
| From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lauri Alanko
| Sent: 11 October 2007 11:27
| To: haskell
| Subject: Re: [Haskell] Fingerprints and hashing
|
| If compositionality is important, at least Rabin's fingerprints are
| worth considering: http://citeseer.ist.psu.edu/broder93some.html
|
| They have the neat property that the fingerprint of a concatenation of
| strings can be cheaply computed from the fingerprints of the
| constituents. I think this effectively means that the plusFP operation
| can be made associative, which might offer optimization opportunities.
|
| I remember implementing it some seven years ago when prototyping for a
| C implementation. The code is lost, though. I can give it a shot
| again, if this is considered a reasonable algorithm.
|
|
| Lauri
| ___
| Haskell mailing list
| Haskell@haskell.org
| http://www.haskell.org/mailman/listinfo/haskell
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell





___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell