Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Guido van Rossum
On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan p...@mcmillan.ws wrote:

  Still, it would represent an effort for the attacker of a much greater
  magnitude than the current attack. It's all a trade-off -- at some point
  it'll just be easier for the attacker to use some other vulnerability.
 Also
  the class of vulnerable servers would be greatly reduced.

 I agree that doing anything is better than doing nothing. If we use
 the earlier suggestion and prepend everything with a fixed-length
 seed, we need quite a bit of entropy (and so a fairly long string) to
 make a lasting difference.


Ah, but the effect of that long string is summarized in a single (32- or
64-bit) integer.


   I'm not sure I understand this. What's the worry about a different
 class of
  hash functions? (It may be clear that I do not have a deep mathematical
  understanding of hash functions.)

 This was mostly in reference to earlier suggestions of switching to
 cityhash, or using btrees, or other more invasive changes. Python 2.X
 is pretty stable and making large changes like that to the codebase
 can have unpredictable effects.


Agreed.


 We know that the current hash function
 works well (except for this specific problem), so it seems like the
 best fix will be as minimal a modification as possible, to avoid
 introducing bugs.


Yup.


  I forget -- what do we do on systems without urandom()? (E.g. Windows?)
 Windows has CryptGenRandom which is approximately equivalent.

  Let's worry about timing attacks another time okay?
 Agreed. As long as there isn't a gaping hole, I'm fine with that.

  Hm. I'm not sure I like the idea of extra arithmetic for every character
  being hashed.

 From a performance standpoint, this may still be better than adding 8
 or 10 characters to every single hash operation, since most hashes are
 over short strings.


But how about precomputing the intermediate value (x)? The hash is (mostly)
doing x = f(x, c) for each c in the input.

It is important that this function touches every
 character - if it only interacts with a subset of them, an attacker
 can fix that subset and vary the rest.


I sort of see your point, but I still think that if we could add as little
per-character overhead as possible it would be best -- sometimes people
*do* hash very long strings.


  But I like the idea of a bigger random seed from which we
  deterministically pick some part.

 Yeah. This makes it much harder to attack, since it very solidly
 places the attacker outside the realm of just brute force the key.

  How about just initializing x to some
  subsequence of the seed determined by e.g. the length of the hashed
 string
  plus a few characters from it?

 We did consider this, and if performance is absolutely the prime
 directive, this (or a variant) may be the best option. Unfortunately,
 the collision generator doesn't necessarily vary the length of the
 string. Additionally, if we don't vary based on all the letters in the
 string, an attacker can fix the characters that we do use and generate
 colliding strings around them.


Still, much more work for the attacker.


 Another option to consider would be to apply this change to some but
 not all of the rounds. If we added the seed lookup xor operation for
 only the first and last 5 values of x, we would still retain much of
 the benefit without adding much computational overhead for very long
 strings.


I like that.


 We could also consider a less computationally expensive operation than
 the modulo for calculating the lookup index, like simply truncating to
 the correct number of bits.


Sure. Thanks for thinking about all the details here!!

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Guido van Rossum
Different concern. What if someone were to have code implementing an
external, persistent hash table, using Python's hash() function? They might
have a way to rehash everything when a new version of Python comes along,
but they would not be happy if hash() is different in each process. I
somehow vaguely remember possibly having seen such code, or something else
where a bit of random data was needed and hash() was used since it's so
easily available.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Guido van Rossum
PS. Is the collision-generator used in the attack code open source?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 16:13, schrieb Guido van Rossum:
 Different concern. What if someone were to have code implementing an
 external, persistent hash table, using Python's hash() function? They
 might have a way to rehash everything when a new version of Python comes
 along, but they would not be happy if hash() is different in each
 process. I somehow vaguely remember possibly having seen such code, or
 something else where a bit of random data was needed and hash() was used
 since it's so easily available.

I had the same concern as you and was worried that projects like ZODB
might require a stable hash function. Fred already stated that ZODB
doesn't use the hash in its btree structures.

Possible solutions:

 * make it possible to provide the seed as an env var

 * disable randomizing as default setting or at least add an option to
disable randomization

IMHO the issue needs a PEP that explains the issue, shows all possible
solutions and describes how we have solved the issue. I'm willing to
start a PEP. Who likes to be the co-author?

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 31.12.2011 23:38, schrieb Terry Reedy:
 On 12/31/2011 4:43 PM, PJ Eby wrote:
 
 Here's an idea.  Suppose we add a sys.hash_seed or some such, that's
 settable to an int, and defaults to whatever we're using now.  Then
 programs that want a fix can just set it to a random number,
 
 I do not think we can allow that to change once there are hashed 
 dictionaries existing.

Me, too. Armin suggested to use an env var as random.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 00:56, schrieb Guido van Rossum:
 ISTM the only reasonable thing is to have a random seed picked very
 early in the process, to be used to change the hash() function of
 str/bytes/unicode (in a way that they are still compatible with each other).
 
 The seed should be unique per process except it should survive fork()
 (but not exec()). I'm not worried about unrelated processes needing to
 have the same hash(), but I'm not against offering an env variable or
 command line flag to force the seed.

I've created a clone at http://hg.python.org/features/randomhash/ as a
testbed. The code creates the seed very early in PyInitializeEx(). The
method isn't called on fork() but on exec().

 I'm not too concerned about a 3rd party being able to guess the random
 seed -- this would require much more effort on their part, since they
 would have to generate a new set of colliding keys each time they think
 they have guessed the hash (as long as they can't force the seed -- this
 actually argues slightly *against* offering a way to force the seed,
 except that we have strong backwards compatibility requirements).

The talkers claim and have shown that it's too easy to pre-calculate
collisions with hashing algorithms similar to DJBX33X / DJBX33A. It
might be a good idea to change the hashing algorithm, too. Paul as
listed some new algorithms. Ruby 1.9 is using FNV
http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a
good dispersion pattern. A hashing algorithm without a
meet-in-the-middle vulnerability would reduce the pressure on a good and
secure seed, too.

 We need to fix this as far back as Python 2.6, and it would be nice if a
 source patch was available that works on Python 2.5 -- personally I do
 have a need for a 2.5 fix and if nobody creates one I will probably end
 up backporting the fix from 2.6 to 2.5.

+1

Should the randomization be disabled on 2.5 to 3.2 by default to reduce
backward compatibility issues?

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 05:11, schrieb Antoine Pitrou:
 On Sat, 31 Dec 2011 16:56:00 -0700
 Guido van Rossum gu...@python.org wrote:
 ISTM the only reasonable thing is to have a random seed picked very early
 in the process, to be used to change the hash() function of
 str/bytes/unicode (in a way that they are still compatible with each other).
 
 Do str and bytes still have to be compatible with each other in 3.x?

py3k has tests for hash(ascii) == hash(bascii). Are you talking
about this invariant?

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Antoine Pitrou
On Sun, 01 Jan 2012 16:48:32 +0100
Christian Heimes li...@cheimes.de wrote:
 The talkers claim and have shown that it's too easy to pre-calculate
 collisions with hashing algorithms similar to DJBX33X / DJBX33A. It
 might be a good idea to change the hashing algorithm, too. Paul as
 listed some new algorithms. Ruby 1.9 is using FNV
 http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a
 good dispersion pattern.

We already seem to be using a FNV-alike, is it just a matter of
changing the parameters?

 A hashing algorithm without a
 meet-in-the-middle vulnerability would reduce the pressure on a good and
 secure seed, too.
 
  We need to fix this as far back as Python 2.6, and it would be nice if a
  source patch was available that works on Python 2.5 -- personally I do
  have a need for a 2.5 fix and if nobody creates one I will probably end
  up backporting the fix from 2.6 to 2.5.
 
 +1
 
 Should the randomization be disabled on 2.5 to 3.2 by default to reduce
 backward compatibility issues?

Isn't 2.5 already EOL'ed?
As for 3.2, I'd say no. I don't know about 2.6 and 2.7.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Antoine Pitrou
On Sun, 01 Jan 2012 16:56:19 +0100
Christian Heimes li...@cheimes.de wrote:
 Am 01.01.2012 05:11, schrieb Antoine Pitrou:
  On Sat, 31 Dec 2011 16:56:00 -0700
  Guido van Rossum gu...@python.org wrote:
  ISTM the only reasonable thing is to have a random seed picked very early
  in the process, to be used to change the hash() function of
  str/bytes/unicode (in a way that they are still compatible with each 
  other).
  
  Do str and bytes still have to be compatible with each other in 3.x?
 
 py3k has tests for hash(ascii) == hash(bascii). Are you talking
 about this invariant?

Yes. It doesn't seem to have any point anymore.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 06:57, schrieb Paul McMillan:
 I agree that doing anything is better than doing nothing. If we use
 the earlier suggestion and prepend everything with a fixed-length
 seed, we need quite a bit of entropy (and so a fairly long string) to
 make a lasting difference.

Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2
MB (2**21 - 1) data from urandom. I'm worried that this is going to
exhaust the OS's random pool and suck it dry. We shouldn't forget that
Python is used for long running processes as well as short scripts. Your
suggestion also increases the process size by 2 MB which is going to be
an issue for mobile and embedded platforms.

How about this:

r = [ord(i) for i in os.urandom(256)]
rs = os.urandom(4) # or 8 ?
seed = rs[-1] + (rs[-2]  8) + (rs[-3]  16) + (rs[-4]  24)

def _hash_string(s):
The algorithm behind compute_hash() for a string or a unicode.
from pypy.rlib.rarithmetic import intmask
length = len(s)
if length == 0:
return -1
x = intmask(seed + (ord(s[0])  7))
i = 0
while i  length:
o = ord(s[i])
x = intmask((103*x) ^ o ^ r[o % 0xff]
i += 1
x ^= length
return intmask(x)

This combines a random seed for the hash with your suggestion.

We also need to special case short strings. The above routine hands over
the seed to attackers, if he is able to retrieve lots of single
character hashes. The randomization shouldn't be used if we can prove
that it's not possible to create hash collisions for strings shorter
than X. For example 64bit FNV-1 has no collisions for 8 chars or less,
32bit FNV has no collisions for 4 or less cars.

Christian

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 17:09, schrieb Antoine Pitrou:
 On Sun, 01 Jan 2012 16:48:32 +0100
 Christian Heimes li...@cheimes.de wrote:
 The talkers claim and have shown that it's too easy to pre-calculate
 collisions with hashing algorithms similar to DJBX33X / DJBX33A. It
 might be a good idea to change the hashing algorithm, too. Paul as
 listed some new algorithms. Ruby 1.9 is using FNV
 http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a
 good dispersion pattern.
 
 We already seem to be using a FNV-alike, is it just a matter of
 changing the parameters?

No, we are using something similar to DJBX33X. FNV is a completely
different type of hash algorithm.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Antoine Pitrou
Le dimanche 01 janvier 2012 à 17:34 +0100, Christian Heimes a écrit :
 Am 01.01.2012 17:09, schrieb Antoine Pitrou:
  On Sun, 01 Jan 2012 16:48:32 +0100
  Christian Heimes li...@cheimes.de wrote:
  The talkers claim and have shown that it's too easy to pre-calculate
  collisions with hashing algorithms similar to DJBX33X / DJBX33A. It
  might be a good idea to change the hashing algorithm, too. Paul as
  listed some new algorithms. Ruby 1.9 is using FNV
  http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a
  good dispersion pattern.
  
  We already seem to be using a FNV-alike, is it just a matter of
  changing the parameters?
 
 No, we are using something similar to DJBX33X. FNV is a completely
 different type of hash algorithm.

I don't understand. FNV-1 multiplies the current running result with a
prime and then xors it with the following byte. This is also what we do.
(I'm assuming 103 is prime)

I see two differences:
- FNV starts with a non-zero constant offset basis
- FNV uses a different prime than ours

(as a side note, FNV operates on bytes, but for unicode we must operate
on code points in [0, 1114111]: although arguably the common case is
hashing ASCII substrings (protocol tokens etc.))

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 17:54, schrieb Antoine Pitrou:
 I don't understand. FNV-1 multiplies the current running result with a
 prime and then xors it with the following byte. This is also what we do.
 (I'm assuming 103 is prime)

There must be a major difference somewhere inside the algorithm. The
talk at the CCC conference in Berlin mentions that Ruby 1.9 is not
vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C
code of FNV is more complex than our code, too.

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Christian Heimes
Am 01.01.2012 01:11, schrieb Guido van Rossum:
 FWIW I managed to build Python 2.6, and a trivial mutation of the
 string/unicode hash function (add 1 initially) made only three tests
 fail; test_symtable and test_json both have a dependency on dictionary
 order, test_ctypes I can't quite figure out what's going on.

In my fork, these tests are failing:

test_dbm test_dis test_gdb test_inspect test_packaging test_set
test_symtable test_urllib test_userdict test_collections
test_json

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Victor Stinner

Le 01/01/2012 04:29, Paul McMillan a écrit :

This is incorrect. Once an attacker has guessed the random seed, any
operation which reveals the ordering of hashed objects can be used to
verify the answer. JSON responses would be ideal. In fact, an attacker
can do a brute-force attack of the random seed offline. Once they have
the seed, generating collisions is a fast process.


If we want to protect a website against this attack for example, we must 
suppose that the attacker can inject arbitrary data and can get 
(indirectly) the result of hash(str) (e.g. with the representation of a 
dict in a traceback, with a JSON output, etc.).



The goal isn't perfection, but we need to do better than a simple
salt.


I disagree. I don't want to break backward compatibility and have a 
hash() function different for each process, if the change is not an 
effective protection against the hash vulnerability.


It's really hard to write a good (secure) hash function: see for example 
the recent NIST competition (started in 2008, will end this year). Even 
good security researcher are unable to write a strong and fast hash 
function. It's easy to add a weakness in the function if you don't have 
a good background in cryptography. The NIST competition gives 4 years to 
analyze new hash functions. We should not rush to add a quick hack if 
it doesn't solve correctly the problem (protect against a collision 
attack and preimage attack).


http://en.wikipedia.org/wiki/NIST_hash_function_competition
http://en.wikipedia.org/wiki/Collision_attack

Runtime performance does matter, I'm not completly sure that changing 
Python is the best place to add a countermeasure against a 
vulnerability. I don't want to slow down numpy for a web vulnerability. 
Because there are different use cases, a better compromise is maybe to 
add a runtime option to use a secure hash function, and keep the unsafe 
but fast hash function by default.



I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5


Always allocate 2**21 bytes just to workaround one specific kind of 
attack is not acceptable. I suppose that the maximum acceptable is 4096 
bytes (or better 256 bytes).


Crytographic hash functions don't need random data, why would Python 
need 2 MB (!) for its hash function?


Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Terry Reedy

On 1/1/2012 10:13 AM, Guido van Rossum wrote:

PS. Is the collision-generator used in the attack code open source?


As I posted before, Alexander Klink and Julian Wälde gave their project 
email as hash...@alech.de. Since they indicated disappointment in not 
hearing from Python, I presume they would welcome engagement.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Terry Reedy

On 1/1/2012 12:28 PM, Christian Heimes wrote:

Am 01.01.2012 17:54, schrieb Antoine Pitrou:

I don't understand. FNV-1 multiplies the current running result with a
prime and then xors it with the following byte. This is also what we do.
(I'm assuming 103 is prime)


There must be a major difference somewhere inside the algorithm. The
talk at the CCC conference in Berlin mentions that Ruby 1.9 is not
vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C
code of FNV is more complex than our code, too.


I understood Alexander Klink and Julian Wälde, hash...@alech.de, as 
saying that they consider that using a random non-zero start value is 
sufficient to make the hash non-vulnerable.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Jim Jewett
Steven D'Aprano (in
http://mail.python.org/pipermail/python-dev/2011-December/115162.html)
wrote:

 By compile-time, do you mean when the byte-code is compilated, i.e. just
 before runtime, rather than a switch when compiling the Python executable from
 source?

No.  I really mean when the C code is initially compiled to produce an
python executable.

The only reason we're worrying about this is that an adversary may
force worst-case performance.  If the python instance isn't a server,
or at least isn't exposed to untrusted clients, then even a single
extra if test is unjustified overhead.  Adding overhead to every
string hash or every dict lookup is bad.

That said, adding some overhead (only) to dict lookups *that already
hit half a dozen consecutive collisions* probably is reasonable,
because that won't happen very often with normal data.  (6 collisions
can't happen at all unless there are already at least 6 entries, so
small dicts are safe; with at least 1/3 of the slots empty, it should
happen only 1/729 for worst-size larger dicts.)

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
 But how about precomputing the intermediate value (x)? The hash is (mostly)
 doing x = f(x, c) for each c in the input.

That's a fair point. If we go down that avenue, I think simply
choosing a random fixed starting value for x is the correct choice,
rather than computing an intermediate value.

 I sort of see your point, but I still think that if we could add as little
 per-character overhead as possible it would be best -- sometimes people *do*
 hash very long strings.

Yep, agreed. My original proposal did not adequately address this.

 Another option to consider would be to apply this change to some but
 not all of the rounds. If we added the seed lookup xor operation for
 only the first and last 5 values of x, we would still retain much of
 the benefit without adding much computational overhead for very long
 strings.

 I like that.

I believe this is a reasonable solution. An attacker could still
manipulate the internal state of long strings, but the additional
information at both ends should make that difficult to exploit. I'll
talk it over with the reviewers.

 We could also consider a less computationally expensive operation than
 the modulo for calculating the lookup index, like simply truncating to
 the correct number of bits.

 Sure. Thanks for thinking about all the details here!!

Again, I'll talk to the reviewers (and run the randomness test
battery) to be double-check that this doesn't affect the distribution
in some unexpected way, but I think it should be fine.

 PS. Is the collision-generator used in the attack code open source?

Not in working form, and they've turned down requests for it from
other projects that want to check their work. If it's imperative that
we have one, I can write one, but I'd rather not spend the effort if
we don't need it.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
 Different concern. What if someone were to have code implementing an
 external, persistent hash table, using Python's hash() function? They might
 have a way to rehash everything when a new version of Python comes along,
 but they would not be happy if hash() is different in each process. I
 somehow vaguely remember possibly having seen such code, or something else
 where a bit of random data was needed and hash() was used since it's so
 easily available.

I agree that there are use cases for allowing users to choose the
random seed, in much the same way it's helpful to be able to set it
for the random number generator. This should probably be something
that can be passed in at runtime. This feature would also be useful
for users who want to synchronize the hashes of multiple independent
processes, for whatever reason. For the general case though,
randomization should be on by default.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Jim Jewett
Paul McMillan in
http://mail.python.org/pipermail/python-dev/2012-January/115183.html
wrote:

 Guido van Rossum wrote:
 Hm. I'm not sure I like the idea of extra arithmetic for every character
 being hashed.

 the collision generator doesn't necessarily vary the length of the
 string. Additionally, if we don't vary based on all the letters in the
 string, an attacker can fix the characters that we do use and generate
 colliding strings around them.

If the new hash algorithm doesn't kick in before, say, 32 characters,
then most currently hashed strings will not be affected.  And if the
attacker has to add 32 characters to every key, it reduces the this
can be done with only N bytes uploaded risk.  (The same logic
would apply to even longer prefixes, except that an attacker might
more easily find short-enough strings that collide.)

 We could also consider a less computationally expensive operation
 than the modulo for calculating the lookup index, like simply truncating
 to the correct number of bits.

Given that the modulo is always 2^N, how is that different?

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Jim Jewett
Victor Stinner wrote in
http://mail.python.org/pipermail/python-dev/2012-January/115198.html

 If we want to protect a website against this attack for example, we must
 suppose that the attacker can inject arbitrary data and can get
 (indirectly) the result of hash(str) (e.g. with the representation of a
 dict in a traceback, with a JSON output, etc.).

(1)  Is it common to hash non-string input?  Because generating integers
that collide for certain dict sizes is pretty easy...

(2)  Would it make sense for traceback printing to sort dict keys?  (Any site
worried about this issue should already be hiding tracebacks from untrusted
clients, but the cost of this extra protection may be pretty small, given that
tracebacks shouldn't be printed all that often in the first place.)

(3)  Should the docs for json.encoder.JSONEncoder suggest sort_keys=True?

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread Jim Jewett
In http://mail.python.org/pipermail/python-dev/2011-December/115172.html,
P. J. Eby wrote:

 On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull stephen at xemacs.org 
 wrote:

 While the dictionary probe has to start with a hash for backward
 compatibility reasons, is there a reason the overflow strategy for
 insertion has to be buckets containing lists?  How about
 double-hashing, etc?

 This won't help, because the keys still have the same hash value. ANYTHING
 you do to them after they're generated will result in them still colliding.

 The *only* thing that works is to change the hash function in such a way
 that the strings end up with different hashes in the first place.
 Otherwise, you'll still end up with (deliberate) collisions.

Well, there is nothing wrong with switching to a different hash function after N
collisions, rather than in the first place.  The perturbation
effectively does by
shoving the high-order bits through the part of the hash that survives the mask.

 (Well, technically, you could use trees or some other O log n data
 structure as a fallback once you have too many collisions, for some value
 of too many.  Seems a bit wasteful for the purpose, though.)

Your WSGI specification  http://www.python.org/dev/peps/pep-0333/  requires
using a real dictionary for compatibility; storing some of the values
outside the
values array would violate that.  Do you consider that obsolete?

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread Christian Heimes
Am 02.01.2012 01:37, schrieb Jim Jewett:
 Well, there is nothing wrong with switching to a different hash function 
 after N
 collisions, rather than in the first place.  The perturbation
 effectively does by
 shoving the high-order bits through the part of the hash that survives the 
 mask.

Except that it won't work or slow down every lookup of missing keys?
It's absolutely crucial that the lookup time is kept as fast as possible.

You can't just change the hash algorithm in the middle of the work
without a speed impact on lookups. The size of the dict can shrink or
grow over time. This results into a different number of collisions for
the same string. Cuckoo hashing
(http://en.wikipedia.org/wiki/Hash_table#Collision_resolution) doesn't
sound feasible for us because it slows down lookup and requires an ABI
incompatible change for more hash slots on str/bytes/unicode instances.

Christian

PS: Something is wrong with your email client. Every of your replies
starts a new thread for me.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread Antoine Pitrou
On Mon, 02 Jan 2012 02:04:38 +0100
Christian Heimes li...@cheimes.de wrote:
 
 PS: Something is wrong with your email client. Every of your replies
 starts a new thread for me.

Same here.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread Jim Jewett
On Sun, Jan 1, 2012 at 8:04 PM, Christian Heimes li...@cheimes.de wrote:
 Am 02.01.2012 01:37, schrieb Jim Jewett:
 Well, there is nothing wrong with switching to a different hash function 
 after N
 collisions, rather than in the first place.  The perturbation
 effectively does by
 shoving the high-order bits through the part of the hash that survives the 
 mask.

 Except that it won't work or slow down every lookup of missing keys?
 It's absolutely crucial that the lookup time is kept as fast as possible.

It will only slow down missing keys that themselves hit more than N collisions.

Or were you assuming that I meant to switch the whole table, rather
than just that one key?  I agree that wouldn't work.

 You can't just change the hash algorithm in the middle of the work
 without a speed impact on lookups.

Right -- but there is nothing wrong with modifying the lookdict (and
insert_clean) functions to do something different after the Nth
collision than they did after the N-1th.

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread PJ Eby
On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett jimjjew...@gmail.com wrote:

 Well, there is nothing wrong with switching to a different hash function
 after N
 collisions, rather than in the first place.  The perturbation
 effectively does by
 shoving the high-order bits through the part of the hash that survives the
 mask.


Since these are true hash collisions, they will all have the same high
order bits.  So, the usefulness of the perturbation is limited mainly to
the common case where true collisions are rare.


 (Well, technically, you could use trees or some other O log n data
  structure as a fallback once you have too many collisions, for some value
  of too many.  Seems a bit wasteful for the purpose, though.)

 Your WSGI specification  http://www.python.org/dev/peps/pep-0333/ 
 requires
 using a real dictionary for compatibility; storing some of the values
 outside the
 values array would violate that.


When I said use some other data structure, I was referring to the
internal implementation of the dict type, not to user code.  The only
user-visible difference (even at C API level) would be the order of keys()
et al.  (In any case, I still assume this is too costly an implementation
change compared to changing the hash function or seeding it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html

2012-01-01 Thread Jim Jewett
On Sun, Jan 1, 2012 at 10:00 PM, PJ Eby p...@telecommunity.com wrote:
 On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett jimjjew...@gmail.com wrote:

 Well, there is nothing wrong with switching to a different hash function
 after N
 collisions, rather than in the first place.  The perturbation
 effectively does by
 shoving the high-order bits through the part of the hash that survives the
 mask.

 Since these are true hash collisions, they will all have the same high order
 bits.  So, the usefulness of the perturbation is limited mainly to the
 common case where true collisions are rare.

That is only because the perturb is based solely on the hash.
Switching to an entirely new hash after the 5th collision (for a given
lookup) would resolve that (after the 5th collision); the question is
whether or not the cost is worthwhile.

  (Well, technically, you could use trees or some other O log n data
  structure as a fallback once you have too many collisions, for some
  value
  of too many.  Seems a bit wasteful for the purpose, though.)

 Your WSGI specification  http://www.python.org/dev/peps/pep-0333/ 
 requires
 using a real dictionary for compatibility; storing some of the values
 outside the
 values array would violate that.

 When I said use some other data structure, I was referring to the internal
 implementation of the dict type, not to user code.  The only user-visible
 difference (even at C API level) would be the order of keys() et al.

Given the wording requiring a real dictionary, I would have assumed
that it was OK (if perhaps not sensible) to do pointer arithmetic and
access the keys/values/hashes directly.  (Though if the breakage was
between python versions, I would feel guilty about griping too
loudly.)

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 7 clarification request: braces

2012-01-01 Thread Nick Coghlan
I've been having an occasional argument with Benjamin regarding braces
in 4-line if statements:

  if (cond)
statement;
  else
statement;

vs.

  if (cond) {
statement;
  } else {
statement;
  }

He keeps leaving them out, I occasionally tell him they should always
be included (most recently this came up when we gave conflicting
advice to a patch contributor). He says what he's doing is OK, because
he doesn't consider the example in PEP 7 as explicitly disallowing it,
I think it's a recipe for future maintenance hassles when someone adds
a second statement to one of the clauses but doesn't add the braces.
(The only time I consider it reasonable to leave out the braces is for
one liner if statements, where there's no else clause at all)

Since Benjamin doesn't accept the current brace example in PEP 7 as
normative for the case above, I'm bringing it up here to seek
clarification.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 7 clarification request: braces

2012-01-01 Thread Ben Finney
Nick Coghlan ncogh...@gmail.com writes:

 He keeps leaving [braces] out [when the block is a single statement],
 I occasionally tell him they should always be included (most recently
 this came up when we gave conflicting advice to a patch contributor).

As someone who has maintained his fair share of C code, I am firmly on
the side of unconditionally (!) enclosing C statement blocks in braces
regardless of how many statements they contain.

 He says what he's doing is OK, because he doesn't consider the example
 in PEP 7 as explicitly disallowing it

I wonder if he has a stronger argument in favour of his position,
because “it's not forbidden” doesn't imply “it's okay”.

 I think it's a recipe for future maintenance hassles when someone adds
 a second statement to one of the clauses but doesn't add the braces.

Agreed, it's an issue of code maintainability. Which is enough of a
problem in C code that a low-cost improvement like this should always be
done.

But, as someone who carries no water in the Python developer community,
my opinion has no more force than the arguments, and I can't impose it
on anyone. Take it for what it's worth.

-- 
 \ “God was invented to explain mystery. God is always invented to |
  `\ explain those things that you do not understand.” —Richard P. |
_o__)Feynman, 1988 |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 7 clarification request: braces

2012-01-01 Thread Scott Dial
On 1/1/2012 11:44 PM, Nick Coghlan wrote:
 I think it's a recipe for future maintenance hassles when someone adds
 a second statement to one of the clauses but doesn't add the braces.
 (The only time I consider it reasonable to leave out the braces is for
 one liner if statements, where there's no else clause at all)

Could you explain how these two cases differ with regard to maintenance?

In either case, there are superfluous edits required if the original
author had used braces *always*. Putting a brace on one-liners adds only
a single line to the code -- just like in the if/else case. So, your
argument seems conflicted. Surely, you would think this is a simpler
edit to make and diff to see in a patch file:

 if(cond) {
   stmt1;
+  stmt2;
 }

vs.

-if(cond)
+if(cond) {
   stmt1;
+  stmt2;
+}

Also, the superfluous edits will wrongly attribute the blame for the
conditional to the wrong author.

-- 
Scott Dial
sc...@scottdial.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
I fixed a couple things in my proposed algorithm:
https://gist.github.com/0a91e52efa74f61858b5

I had a typo, and used 21 instead of 12 for the size multiplier. We
definitely don't need 2MB random data.

The initialization of r was broken. Now it is an array of ints, so
there's no conversion when it's used. I've adjusted it so there's 8k
of random data, broken into 2048 ints.

I added a length-based seed to the initial value of x. This prevents
single-characters from being used to enumerate raw values from r.
This is similar to the change proposed by Christian Heimes.

Most importantly, I moved the xor with r[x % len_r] down a line.
Before, it wasn't being applied to the last character.

 Christian Heimes said:
 We also need to special case short strings. The above routine hands over
 the seed to attackers, if he is able to retrieve lots of single
 character hashes.

The updated code always includes at least 2 lookups from r, which I
believe solves the single-character enumeration problem. If we
special-case part of our hash function for short strings, we may get
suboptimal collisions between the two types of hashes.

I think Ruby uses FNV-1 with a salt, making it less vulnerable to
this. FNV is otherwise similar to our existing hash function.

For the record, cryptographically strong hash functions are in the
neighborhood of 400% slower than our existing hash function.

 Terry Reedy said:
 I understood Alexander Klink and Julian Wälde, hash...@alech.de, as saying
 that they consider that using a random non-zero start value is sufficient to
 make the hash non-vulnerable.

I've been talking to them. They're happy to look at our proposed
changes. They indicate that a non-zero start value is sufficient to
prevent the attack, but D. J. Bernstein disagrees with them. He also
has indicated a willingness to look at our solution.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 7 clarification request: braces

2012-01-01 Thread Nick Coghlan
On Mon, Jan 2, 2012 at 3:04 PM, Scott Dial
scott+python-...@scottdial.com wrote:
 On 1/1/2012 11:44 PM, Nick Coghlan wrote:
 I think it's a recipe for future maintenance hassles when someone adds
 a second statement to one of the clauses but doesn't add the braces.
 (The only time I consider it reasonable to leave out the braces is for
 one liner if statements, where there's no else clause at all)

 Could you explain how these two cases differ with regard to maintenance?

Sure: always including KR style braces for compound statements (even
when they aren't technically necessary) means that indentation ==
control flow, just like Python. Indent your additions correctly, and
the reader and compiler will agree on what they mean:

if (cond) {
statement;
} else {
statement;
addition;  /* Reader and compiler agree this is part of the else clause */
}

if (cond)
statement;
else
statement;
addition;  /* Uh-oh, should have added braces */

I've been trying to convince Benjamin that there's a reason always
include the braces is accepted wisdom amongst many veteran C
programmers (with some allowing an exception for one-liners), but he
isn't believing me, and I'm not going to go through and edit every
single one of his commits to add them.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] That depends on what the meaning of is is (was Re: http://mail.python.org/pipermail/python-dev/2011-December/115172.html)

2012-01-01 Thread PJ Eby
On Sun, Jan 1, 2012 at 10:28 PM, Jim Jewett jimjjew...@gmail.com wrote:

 Given the wording requiring a real dictionary, I would have assumed
 that it was OK (if perhaps not sensible) to do pointer arithmetic and
 access the keys/values/hashes directly.  (Though if the breakage was
 between python versions, I would feel guilty about griping too
 loudly.)


If you're going to be a language lawyer about it, I would simply point out
that all the spec requires is that type(env) is dict -- it says nothing
about how Python defines type or is or dict.  So, you're on your own
with that one. ;-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 7 clarification request: braces

2012-01-01 Thread Ron Adam
On Mon, 2012-01-02 at 14:44 +1000, Nick Coghlan wrote:
 I've been having an occasional argument with Benjamin regarding braces
 in 4-line if statements:
 
   if (cond)
 statement;
   else
 statement;
 
 vs.
 
   if (cond) {
 statement;
   } else {
 statement;
   }
 
 He keeps leaving them out, I occasionally tell him they should always
 be included (most recently this came up when we gave conflicting
 advice to a patch contributor). He says what he's doing is OK, because
 he doesn't consider the example in PEP 7 as explicitly disallowing it,

 I think it's a recipe for future maintenance hassles when someone adds
 a second statement to one of the clauses but doesn't add the braces.

I've had to correct my self on this one a few times so I will have to
agree it's a good practice.

 (The only time I consider it reasonable to leave out the braces is for
 one liner if statements, where there's no else clause at all)

The problem is only when an additional statement is added to the last
block, not the preceding ones, as the compiler will complain about
those.  So I don't know how the 4 line example without braces is any
worse than a 2 line if without braces.

I think my preference is, if any block in a multi-block expression needs
braces, then the other blocks should have it.  (Including the last block
even if it's a single line.)

The next level up would be to require them on all blocks, even two line
if expressions, but I'm not sure that is really needed.  At some point
the extra noise of the braces makes things harder to read rather than
easier, and what you gain in preventing one type of error may increase
chances of another type of error not being noticed.

Cheers,
   Ron



 Since Benjamin doesn't accept the current brace example in PEP 7 as
 normative for the case above, I'm bringing it up here to seek
 clarification.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com