Re: [Python-Dev] Hash collision security issue (now public)
On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan p...@mcmillan.ws wrote: Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced. I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference. Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer. I'm not sure I understand this. What's the worry about a different class of hash functions? (It may be clear that I do not have a deep mathematical understanding of hash functions.) This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. Agreed. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs. Yup. I forget -- what do we do on systems without urandom()? (E.g. Windows?) Windows has CryptGenRandom which is approximately equivalent. Let's worry about timing attacks another time okay? Agreed. As long as there isn't a gaping hole, I'm fine with that. Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. From a performance standpoint, this may still be better than adding 8 or 10 characters to every single hash operation, since most hashes are over short strings. But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input. It is important that this function touches every character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest. I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings. But I like the idea of a bigger random seed from which we deterministically pick some part. Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of just brute force the key. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it? We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. Still, much more work for the attacker. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. I like that. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. Sure. Thanks for thinking about all the details here!! -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
PS. Is the collision-generator used in the attack code open source? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 16:13, schrieb Guido van Rossum: Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. I had the same concern as you and was worried that projects like ZODB might require a stable hash function. Fred already stated that ZODB doesn't use the hash in its btree structures. Possible solutions: * make it possible to provide the seed as an env var * disable randomizing as default setting or at least add an option to disable randomization IMHO the issue needs a PEP that explains the issue, shows all possible solutions and describes how we have solved the issue. I'm willing to start a PEP. Who likes to be the co-author? Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 31.12.2011 23:38, schrieb Terry Reedy: On 12/31/2011 4:43 PM, PJ Eby wrote: Here's an idea. Suppose we add a sys.hash_seed or some such, that's settable to an int, and defaults to whatever we're using now. Then programs that want a fix can just set it to a random number, I do not think we can allow that to change once there are hashed dictionaries existing. Me, too. Armin suggested to use an env var as random. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 00:56, schrieb Guido van Rossum: ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). The seed should be unique per process except it should survive fork() (but not exec()). I'm not worried about unrelated processes needing to have the same hash(), but I'm not against offering an env variable or command line flag to force the seed. I've created a clone at http://hg.python.org/features/randomhash/ as a testbed. The code creates the seed very early in PyInitializeEx(). The method isn't called on fork() but on exec(). I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash (as long as they can't force the seed -- this actually argues slightly *against* offering a way to force the seed, except that we have strong backwards compatibility requirements). The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. A hashing algorithm without a meet-in-the-middle vulnerability would reduce the pressure on a good and secure seed, too. We need to fix this as far back as Python 2.6, and it would be nice if a source patch was available that works on Python 2.5 -- personally I do have a need for a 2.5 fix and if nobody creates one I will probably end up backporting the fix from 2.6 to 2.5. +1 Should the randomization be disabled on 2.5 to 3.2 by default to reduce backward compatibility issues? Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 05:11, schrieb Antoine Pitrou: On Sat, 31 Dec 2011 16:56:00 -0700 Guido van Rossum gu...@python.org wrote: ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). Do str and bytes still have to be compatible with each other in 3.x? py3k has tests for hash(ascii) == hash(bascii). Are you talking about this invariant? Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sun, 01 Jan 2012 16:48:32 +0100 Christian Heimes li...@cheimes.de wrote: The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. We already seem to be using a FNV-alike, is it just a matter of changing the parameters? A hashing algorithm without a meet-in-the-middle vulnerability would reduce the pressure on a good and secure seed, too. We need to fix this as far back as Python 2.6, and it would be nice if a source patch was available that works on Python 2.5 -- personally I do have a need for a 2.5 fix and if nobody creates one I will probably end up backporting the fix from 2.6 to 2.5. +1 Should the randomization be disabled on 2.5 to 3.2 by default to reduce backward compatibility issues? Isn't 2.5 already EOL'ed? As for 3.2, I'd say no. I don't know about 2.6 and 2.7. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Sun, 01 Jan 2012 16:56:19 +0100 Christian Heimes li...@cheimes.de wrote: Am 01.01.2012 05:11, schrieb Antoine Pitrou: On Sat, 31 Dec 2011 16:56:00 -0700 Guido van Rossum gu...@python.org wrote: ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). Do str and bytes still have to be compatible with each other in 3.x? py3k has tests for hash(ascii) == hash(bascii). Are you talking about this invariant? Yes. It doesn't seem to have any point anymore. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 06:57, schrieb Paul McMillan: I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference. Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2 MB (2**21 - 1) data from urandom. I'm worried that this is going to exhaust the OS's random pool and suck it dry. We shouldn't forget that Python is used for long running processes as well as short scripts. Your suggestion also increases the process size by 2 MB which is going to be an issue for mobile and embedded platforms. How about this: r = [ord(i) for i in os.urandom(256)] rs = os.urandom(4) # or 8 ? seed = rs[-1] + (rs[-2] 8) + (rs[-3] 16) + (rs[-4] 24) def _hash_string(s): The algorithm behind compute_hash() for a string or a unicode. from pypy.rlib.rarithmetic import intmask length = len(s) if length == 0: return -1 x = intmask(seed + (ord(s[0]) 7)) i = 0 while i length: o = ord(s[i]) x = intmask((103*x) ^ o ^ r[o % 0xff] i += 1 x ^= length return intmask(x) This combines a random seed for the hash with your suggestion. We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The randomization shouldn't be used if we can prove that it's not possible to create hash collisions for strings shorter than X. For example 64bit FNV-1 has no collisions for 8 chars or less, 32bit FNV has no collisions for 4 or less cars. Christian Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 17:09, schrieb Antoine Pitrou: On Sun, 01 Jan 2012 16:48:32 +0100 Christian Heimes li...@cheimes.de wrote: The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. We already seem to be using a FNV-alike, is it just a matter of changing the parameters? No, we are using something similar to DJBX33X. FNV is a completely different type of hash algorithm. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Le dimanche 01 janvier 2012 à 17:34 +0100, Christian Heimes a écrit : Am 01.01.2012 17:09, schrieb Antoine Pitrou: On Sun, 01 Jan 2012 16:48:32 +0100 Christian Heimes li...@cheimes.de wrote: The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. We already seem to be using a FNV-alike, is it just a matter of changing the parameters? No, we are using something similar to DJBX33X. FNV is a completely different type of hash algorithm. I don't understand. FNV-1 multiplies the current running result with a prime and then xors it with the following byte. This is also what we do. (I'm assuming 103 is prime) I see two differences: - FNV starts with a non-zero constant offset basis - FNV uses a different prime than ours (as a side note, FNV operates on bytes, but for unicode we must operate on code points in [0, 1114111]: although arguably the common case is hashing ASCII substrings (protocol tokens etc.)) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 17:54, schrieb Antoine Pitrou: I don't understand. FNV-1 multiplies the current running result with a prime and then xors it with the following byte. This is also what we do. (I'm assuming 103 is prime) There must be a major difference somewhere inside the algorithm. The talk at the CCC conference in Berlin mentions that Ruby 1.9 is not vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C code of FNV is more complex than our code, too. Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Am 01.01.2012 01:11, schrieb Guido van Rossum: FWIW I managed to build Python 2.6, and a trivial mutation of the string/unicode hash function (add 1 initially) made only three tests fail; test_symtable and test_json both have a dependency on dictionary order, test_ctypes I can't quite figure out what's going on. In my fork, these tests are failing: test_dbm test_dis test_gdb test_inspect test_packaging test_set test_symtable test_urllib test_userdict test_collections test_json ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Le 01/01/2012 04:29, Paul McMillan a écrit : This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process. If we want to protect a website against this attack for example, we must suppose that the attacker can inject arbitrary data and can get (indirectly) the result of hash(str) (e.g. with the representation of a dict in a traceback, with a JSON output, etc.). The goal isn't perfection, but we need to do better than a simple salt. I disagree. I don't want to break backward compatibility and have a hash() function different for each process, if the change is not an effective protection against the hash vulnerability. It's really hard to write a good (secure) hash function: see for example the recent NIST competition (started in 2008, will end this year). Even good security researcher are unable to write a strong and fast hash function. It's easy to add a weakness in the function if you don't have a good background in cryptography. The NIST competition gives 4 years to analyze new hash functions. We should not rush to add a quick hack if it doesn't solve correctly the problem (protect against a collision attack and preimage attack). http://en.wikipedia.org/wiki/NIST_hash_function_competition http://en.wikipedia.org/wiki/Collision_attack Runtime performance does matter, I'm not completly sure that changing Python is the best place to add a countermeasure against a vulnerability. I don't want to slow down numpy for a web vulnerability. Because there are different use cases, a better compromise is maybe to add a runtime option to use a secure hash function, and keep the unsafe but fast hash function by default. I propose we modify the string hash function like this: https://gist.github.com/0a91e52efa74f61858b5 Always allocate 2**21 bytes just to workaround one specific kind of attack is not acceptable. I suppose that the maximum acceptable is 4096 bytes (or better 256 bytes). Crytographic hash functions don't need random data, why would Python need 2 MB (!) for its hash function? Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On 1/1/2012 10:13 AM, Guido van Rossum wrote: PS. Is the collision-generator used in the attack code open source? As I posted before, Alexander Klink and Julian Wälde gave their project email as hash...@alech.de. Since they indicated disappointment in not hearing from Python, I presume they would welcome engagement. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On 1/1/2012 12:28 PM, Christian Heimes wrote: Am 01.01.2012 17:54, schrieb Antoine Pitrou: I don't understand. FNV-1 multiplies the current running result with a prime and then xors it with the following byte. This is also what we do. (I'm assuming 103 is prime) There must be a major difference somewhere inside the algorithm. The talk at the CCC conference in Berlin mentions that Ruby 1.9 is not vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C code of FNV is more complex than our code, too. I understood Alexander Klink and Julian Wälde, hash...@alech.de, as saying that they consider that using a random non-zero start value is sufficient to make the hash non-vulnerable. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Hash collision security issue (now public)
Steven D'Aprano (in http://mail.python.org/pipermail/python-dev/2011-December/115162.html) wrote: By compile-time, do you mean when the byte-code is compilated, i.e. just before runtime, rather than a switch when compiling the Python executable from source? No. I really mean when the C code is initially compiled to produce an python executable. The only reason we're worrying about this is that an adversary may force worst-case performance. If the python instance isn't a server, or at least isn't exposed to untrusted clients, then even a single extra if test is unjustified overhead. Adding overhead to every string hash or every dict lookup is bad. That said, adding some overhead (only) to dict lookups *that already hit half a dozen consecutive collisions* probably is reasonable, because that won't happen very often with normal data. (6 collisions can't happen at all unless there are already at least 6 entries, so small dicts are safe; with at least 1/3 of the slots empty, it should happen only 1/729 for worst-size larger dicts.) -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input. That's a fair point. If we go down that avenue, I think simply choosing a random fixed starting value for x is the correct choice, rather than computing an intermediate value. I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings. Yep, agreed. My original proposal did not adequately address this. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. I like that. I believe this is a reasonable solution. An attacker could still manipulate the internal state of long strings, but the additional information at both ends should make that difficult to exploit. I'll talk it over with the reviewers. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. Sure. Thanks for thinking about all the details here!! Again, I'll talk to the reviewers (and run the randomness test battery) to be double-check that this doesn't affect the distribution in some unexpected way, but I think it should be fine. PS. Is the collision-generator used in the attack code open source? Not in working form, and they've turned down requests for it from other projects that want to check their work. If it's imperative that we have one, I can write one, but I'd rather not spend the effort if we don't need it. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. I agree that there are use cases for allowing users to choose the random seed, in much the same way it's helpful to be able to set it for the random number generator. This should probably be something that can be passed in at runtime. This feature would also be useful for users who want to synchronize the hashes of multiple independent processes, for whatever reason. For the general case though, randomization should be on by default. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Hash collision security issue (now public)
Paul McMillan in http://mail.python.org/pipermail/python-dev/2012-January/115183.html wrote: Guido van Rossum wrote: Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. If the new hash algorithm doesn't kick in before, say, 32 characters, then most currently hashed strings will not be affected. And if the attacker has to add 32 characters to every key, it reduces the this can be done with only N bytes uploaded risk. (The same logic would apply to even longer prefixes, except that an attacker might more easily find short-enough strings that collide.) We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. Given that the modulo is always 2^N, how is that different? -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Hash collision security issue (now public)
Victor Stinner wrote in http://mail.python.org/pipermail/python-dev/2012-January/115198.html If we want to protect a website against this attack for example, we must suppose that the attacker can inject arbitrary data and can get (indirectly) the result of hash(str) (e.g. with the representation of a dict in a traceback, with a JSON output, etc.). (1) Is it common to hash non-string input? Because generating integers that collide for certain dict sizes is pretty easy... (2) Would it make sense for traceback printing to sort dict keys? (Any site worried about this issue should already be hiding tracebacks from untrusted clients, but the cost of this extra protection may be pretty small, given that tracebacks shouldn't be printed all that often in the first place.) (3) Should the docs for json.encoder.JSONEncoder suggest sort_keys=True? -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
In http://mail.python.org/pipermail/python-dev/2011-December/115172.html, P. J. Eby wrote: On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull stephen at xemacs.org wrote: While the dictionary probe has to start with a hash for backward compatibility reasons, is there a reason the overflow strategy for insertion has to be buckets containing lists? How about double-hashing, etc? This won't help, because the keys still have the same hash value. ANYTHING you do to them after they're generated will result in them still colliding. The *only* thing that works is to change the hash function in such a way that the strings end up with different hashes in the first place. Otherwise, you'll still end up with (deliberate) collisions. Well, there is nothing wrong with switching to a different hash function after N collisions, rather than in the first place. The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of too many. Seems a bit wasteful for the purpose, though.) Your WSGI specification http://www.python.org/dev/peps/pep-0333/ requires using a real dictionary for compatibility; storing some of the values outside the values array would violate that. Do you consider that obsolete? -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
Am 02.01.2012 01:37, schrieb Jim Jewett: Well, there is nothing wrong with switching to a different hash function after N collisions, rather than in the first place. The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask. Except that it won't work or slow down every lookup of missing keys? It's absolutely crucial that the lookup time is kept as fast as possible. You can't just change the hash algorithm in the middle of the work without a speed impact on lookups. The size of the dict can shrink or grow over time. This results into a different number of collisions for the same string. Cuckoo hashing (http://en.wikipedia.org/wiki/Hash_table#Collision_resolution) doesn't sound feasible for us because it slows down lookup and requires an ABI incompatible change for more hash slots on str/bytes/unicode instances. Christian PS: Something is wrong with your email client. Every of your replies starts a new thread for me. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
On Mon, 02 Jan 2012 02:04:38 +0100 Christian Heimes li...@cheimes.de wrote: PS: Something is wrong with your email client. Every of your replies starts a new thread for me. Same here. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
On Sun, Jan 1, 2012 at 8:04 PM, Christian Heimes li...@cheimes.de wrote: Am 02.01.2012 01:37, schrieb Jim Jewett: Well, there is nothing wrong with switching to a different hash function after N collisions, rather than in the first place. The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask. Except that it won't work or slow down every lookup of missing keys? It's absolutely crucial that the lookup time is kept as fast as possible. It will only slow down missing keys that themselves hit more than N collisions. Or were you assuming that I meant to switch the whole table, rather than just that one key? I agree that wouldn't work. You can't just change the hash algorithm in the middle of the work without a speed impact on lookups. Right -- but there is nothing wrong with modifying the lookdict (and insert_clean) functions to do something different after the Nth collision than they did after the N-1th. -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett jimjjew...@gmail.com wrote: Well, there is nothing wrong with switching to a different hash function after N collisions, rather than in the first place. The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask. Since these are true hash collisions, they will all have the same high order bits. So, the usefulness of the perturbation is limited mainly to the common case where true collisions are rare. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of too many. Seems a bit wasteful for the purpose, though.) Your WSGI specification http://www.python.org/dev/peps/pep-0333/ requires using a real dictionary for compatibility; storing some of the values outside the values array would violate that. When I said use some other data structure, I was referring to the internal implementation of the dict type, not to user code. The only user-visible difference (even at C API level) would be the order of keys() et al. (In any case, I still assume this is too costly an implementation change compared to changing the hash function or seeding it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html
On Sun, Jan 1, 2012 at 10:00 PM, PJ Eby p...@telecommunity.com wrote: On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett jimjjew...@gmail.com wrote: Well, there is nothing wrong with switching to a different hash function after N collisions, rather than in the first place. The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask. Since these are true hash collisions, they will all have the same high order bits. So, the usefulness of the perturbation is limited mainly to the common case where true collisions are rare. That is only because the perturb is based solely on the hash. Switching to an entirely new hash after the 5th collision (for a given lookup) would resolve that (after the 5th collision); the question is whether or not the cost is worthwhile. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of too many. Seems a bit wasteful for the purpose, though.) Your WSGI specification http://www.python.org/dev/peps/pep-0333/ requires using a real dictionary for compatibility; storing some of the values outside the values array would violate that. When I said use some other data structure, I was referring to the internal implementation of the dict type, not to user code. The only user-visible difference (even at C API level) would be the order of keys() et al. Given the wording requiring a real dictionary, I would have assumed that it was OK (if perhaps not sensible) to do pointer arithmetic and access the keys/values/hashes directly. (Though if the breakage was between python versions, I would feel guilty about griping too loudly.) -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 7 clarification request: braces
I've been having an occasional argument with Benjamin regarding braces in 4-line if statements: if (cond) statement; else statement; vs. if (cond) { statement; } else { statement; } He keeps leaving them out, I occasionally tell him they should always be included (most recently this came up when we gave conflicting advice to a patch contributor). He says what he's doing is OK, because he doesn't consider the example in PEP 7 as explicitly disallowing it, I think it's a recipe for future maintenance hassles when someone adds a second statement to one of the clauses but doesn't add the braces. (The only time I consider it reasonable to leave out the braces is for one liner if statements, where there's no else clause at all) Since Benjamin doesn't accept the current brace example in PEP 7 as normative for the case above, I'm bringing it up here to seek clarification. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 7 clarification request: braces
Nick Coghlan ncogh...@gmail.com writes: He keeps leaving [braces] out [when the block is a single statement], I occasionally tell him they should always be included (most recently this came up when we gave conflicting advice to a patch contributor). As someone who has maintained his fair share of C code, I am firmly on the side of unconditionally (!) enclosing C statement blocks in braces regardless of how many statements they contain. He says what he's doing is OK, because he doesn't consider the example in PEP 7 as explicitly disallowing it I wonder if he has a stronger argument in favour of his position, because “it's not forbidden” doesn't imply “it's okay”. I think it's a recipe for future maintenance hassles when someone adds a second statement to one of the clauses but doesn't add the braces. Agreed, it's an issue of code maintainability. Which is enough of a problem in C code that a low-cost improvement like this should always be done. But, as someone who carries no water in the Python developer community, my opinion has no more force than the arguments, and I can't impose it on anyone. Take it for what it's worth. -- \ “God was invented to explain mystery. God is always invented to | `\ explain those things that you do not understand.” —Richard P. | _o__)Feynman, 1988 | Ben Finney ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 7 clarification request: braces
On 1/1/2012 11:44 PM, Nick Coghlan wrote: I think it's a recipe for future maintenance hassles when someone adds a second statement to one of the clauses but doesn't add the braces. (The only time I consider it reasonable to leave out the braces is for one liner if statements, where there's no else clause at all) Could you explain how these two cases differ with regard to maintenance? In either case, there are superfluous edits required if the original author had used braces *always*. Putting a brace on one-liners adds only a single line to the code -- just like in the if/else case. So, your argument seems conflicted. Surely, you would think this is a simpler edit to make and diff to see in a patch file: if(cond) { stmt1; + stmt2; } vs. -if(cond) +if(cond) { stmt1; + stmt2; +} Also, the superfluous edits will wrongly attribute the blame for the conditional to the wrong author. -- Scott Dial sc...@scottdial.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
I fixed a couple things in my proposed algorithm: https://gist.github.com/0a91e52efa74f61858b5 I had a typo, and used 21 instead of 12 for the size multiplier. We definitely don't need 2MB random data. The initialization of r was broken. Now it is an array of ints, so there's no conversion when it's used. I've adjusted it so there's 8k of random data, broken into 2048 ints. I added a length-based seed to the initial value of x. This prevents single-characters from being used to enumerate raw values from r. This is similar to the change proposed by Christian Heimes. Most importantly, I moved the xor with r[x % len_r] down a line. Before, it wasn't being applied to the last character. Christian Heimes said: We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The updated code always includes at least 2 lookups from r, which I believe solves the single-character enumeration problem. If we special-case part of our hash function for short strings, we may get suboptimal collisions between the two types of hashes. I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function. For the record, cryptographically strong hash functions are in the neighborhood of 400% slower than our existing hash function. Terry Reedy said: I understood Alexander Klink and Julian Wälde, hash...@alech.de, as saying that they consider that using a random non-zero start value is sufficient to make the hash non-vulnerable. I've been talking to them. They're happy to look at our proposed changes. They indicate that a non-zero start value is sufficient to prevent the attack, but D. J. Bernstein disagrees with them. He also has indicated a willingness to look at our solution. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 7 clarification request: braces
On Mon, Jan 2, 2012 at 3:04 PM, Scott Dial scott+python-...@scottdial.com wrote: On 1/1/2012 11:44 PM, Nick Coghlan wrote: I think it's a recipe for future maintenance hassles when someone adds a second statement to one of the clauses but doesn't add the braces. (The only time I consider it reasonable to leave out the braces is for one liner if statements, where there's no else clause at all) Could you explain how these two cases differ with regard to maintenance? Sure: always including KR style braces for compound statements (even when they aren't technically necessary) means that indentation == control flow, just like Python. Indent your additions correctly, and the reader and compiler will agree on what they mean: if (cond) { statement; } else { statement; addition; /* Reader and compiler agree this is part of the else clause */ } if (cond) statement; else statement; addition; /* Uh-oh, should have added braces */ I've been trying to convince Benjamin that there's a reason always include the braces is accepted wisdom amongst many veteran C programmers (with some allowing an exception for one-liners), but he isn't believing me, and I'm not going to go through and edit every single one of his commits to add them. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] That depends on what the meaning of is is (was Re: http://mail.python.org/pipermail/python-dev/2011-December/115172.html)
On Sun, Jan 1, 2012 at 10:28 PM, Jim Jewett jimjjew...@gmail.com wrote: Given the wording requiring a real dictionary, I would have assumed that it was OK (if perhaps not sensible) to do pointer arithmetic and access the keys/values/hashes directly. (Though if the breakage was between python versions, I would feel guilty about griping too loudly.) If you're going to be a language lawyer about it, I would simply point out that all the spec requires is that type(env) is dict -- it says nothing about how Python defines type or is or dict. So, you're on your own with that one. ;-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 7 clarification request: braces
On Mon, 2012-01-02 at 14:44 +1000, Nick Coghlan wrote: I've been having an occasional argument with Benjamin regarding braces in 4-line if statements: if (cond) statement; else statement; vs. if (cond) { statement; } else { statement; } He keeps leaving them out, I occasionally tell him they should always be included (most recently this came up when we gave conflicting advice to a patch contributor). He says what he's doing is OK, because he doesn't consider the example in PEP 7 as explicitly disallowing it, I think it's a recipe for future maintenance hassles when someone adds a second statement to one of the clauses but doesn't add the braces. I've had to correct my self on this one a few times so I will have to agree it's a good practice. (The only time I consider it reasonable to leave out the braces is for one liner if statements, where there's no else clause at all) The problem is only when an additional statement is added to the last block, not the preceding ones, as the compiler will complain about those. So I don't know how the 4 line example without braces is any worse than a 2 line if without braces. I think my preference is, if any block in a multi-block expression needs braces, then the other blocks should have it. (Including the last block even if it's a single line.) The next level up would be to require them on all blocks, even two line if expressions, but I'm not sure that is really needed. At some point the extra noise of the braces makes things harder to read rather than easier, and what you gain in preventing one type of error may increase chances of another type of error not being noticed. Cheers, Ron Since Benjamin doesn't accept the current brace example in PEP 7 as normative for the case above, I'm bringing it up here to seek clarification. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com