Re: [Python-Dev] Counting collisions for the win
We may use a different salt per dictionary. If we're willing to re-hash everything on a per-dictionary basis. That doesn't seem reasonable given our existing usage. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
On Sat, Jan 21, 2012 at 4:19 PM, Jared Grubb jared.gr...@gmail.com wrote: I agree; it sounds really odd to throw an exception since nothing is actually wrong and there's nothing the caller would do about it to recover anyway. Rather than throwing an exception, maybe you just reseed the random value for the hash This is nonsense. You have to determine the random seed at startup, and it has to be uniform for the entire life of the process. You can't change it after Python has started. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
I may have a terminology problem here. I expect that a random seed must change every time it is used, otherwise the pseudorandom number generator using it just returns the same value each time. Should we be talking about a salt rather than a seed? You should read the several other threads, the bug, as well as the implementation and patch under discussion. Briefly, Python string hashes are calculated once per string, and then used in many places. You can't change the hash value for a string during program execution without breaking everything. The proposed change modifies the starting value of the hash function to include a process-wide randomly generated seed. This seed is chosen randomly at runtime, but cannot change once chosen. Using the seed changes the final output of the hash to be unpredictable to an attacker, solving the underlying problem. Salt could also be an appropriate term here, but since salt is generally changed on a per-use basis (a single process may use many different salts), seed is more correct, since this value is only chosen once per process. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Status of the fix for the hash collision vulnerability
As I understand it, the way the attack works is that a *single* malicious request from the attacker can DoS the server by eating CPU resources while evaluating a massive collision chain induced in a dict by attacker supplied data. Explicitly truncating the collision chain boots them out almost immediately (likely with a 500 response for an internal server error), so they no longer affect other events, threads and processes on the same machine. This is only true in the specific attack presented at 28c3. If an attacker can insert data without triggering the attack, it's possible to produce (in the example of a web application) urls that (regardless of the request) always produce pathological behavior. For example, a collection of pathological usernames might make it impossible to list users (and so choose which ones to delete) without resorting to removing the problem data at an SQL level. This is why the simply throw an error solution isn't a complete fix. Making portions of an interface unusable for regular users is clearly a bad thing, and is clearly applicable to other types of poisoned data as well. We need to detect collisions and work around them transparently. However, such an app would have been crippled by the original DoS anyway, since its performance would have been gutted - the collision chain limiting just means it will trigger exceptions for the cases that would been insanely slow. We can do better than saying it would have been broken before, it's broken differently now. The universal hash function idea has merit, and for practical purposes hash randomization would fix this too (since colliding data is only likely to collide within a single process, persistent poisoning is far less feasible). -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input. That's a fair point. If we go down that avenue, I think simply choosing a random fixed starting value for x is the correct choice, rather than computing an intermediate value. I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings. Yep, agreed. My original proposal did not adequately address this. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. I like that. I believe this is a reasonable solution. An attacker could still manipulate the internal state of long strings, but the additional information at both ends should make that difficult to exploit. I'll talk it over with the reviewers. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. Sure. Thanks for thinking about all the details here!! Again, I'll talk to the reviewers (and run the randomness test battery) to be double-check that this doesn't affect the distribution in some unexpected way, but I think it should be fine. PS. Is the collision-generator used in the attack code open source? Not in working form, and they've turned down requests for it from other projects that want to check their work. If it's imperative that we have one, I can write one, but I'd rather not spend the effort if we don't need it. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. I agree that there are use cases for allowing users to choose the random seed, in much the same way it's helpful to be able to set it for the random number generator. This should probably be something that can be passed in at runtime. This feature would also be useful for users who want to synchronize the hashes of multiple independent processes, for whatever reason. For the general case though, randomization should be on by default. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
I fixed a couple things in my proposed algorithm: https://gist.github.com/0a91e52efa74f61858b5 I had a typo, and used 21 instead of 12 for the size multiplier. We definitely don't need 2MB random data. The initialization of r was broken. Now it is an array of ints, so there's no conversion when it's used. I've adjusted it so there's 8k of random data, broken into 2048 ints. I added a length-based seed to the initial value of x. This prevents single-characters from being used to enumerate raw values from r. This is similar to the change proposed by Christian Heimes. Most importantly, I moved the xor with r[x % len_r] down a line. Before, it wasn't being applied to the last character. Christian Heimes said: We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The updated code always includes at least 2 lookups from r, which I believe solves the single-character enumeration problem. If we special-case part of our hash function for short strings, we may get suboptimal collisions between the two types of hashes. I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function. For the record, cryptographically strong hash functions are in the neighborhood of 400% slower than our existing hash function. Terry Reedy said: I understood Alexander Klink and Julian Wälde, hash...@alech.de, as saying that they consider that using a random non-zero start value is sufficient to make the hash non-vulnerable. I've been talking to them. They're happy to look at our proposed changes. They indicate that a non-zero start value is sufficient to prevent the attack, but D. J. Bernstein disagrees with them. He also has indicated a willingness to look at our solution. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process. The goal isn't perfection, but we need to do better than a simple salt. I propose we modify the string hash function like this: https://gist.github.com/0a91e52efa74f61858b5 This code is based on PyPy's implementation, but the concept is universal. Rather than choosing a single short random seed per process, we generate a much larger random seed (r). As we hash, we deterministically choose a portion of that seed and incorporate it into the hash process. This modification is a minimally intrusive change to the existing hash function, and so should not introduce unexpected side effects which might come from switching to a different class of hash functions. I've worked through this code with Alex Gaynor, Antoine Pitrou, and Victor Stinner, and have asked several mathematicians and security experts to review the concept. The reviewers who have gotten back to me thus far have agreed that if the initial random seed is not flawed, this should not overly change the properties of the hash function, but should make it quite difficult for an attacker to deduce the necessary information to predictably cause hash collisions. This function is not designed to protect against timing attacks, but should be nontrivial to reverse even with access to timing data. Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. This is probably an acceptable trade off, and actually provides better performance in the case of short strings than a high-entropy fixed-length seed prefix. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced. I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference. I'm not sure I understand this. What's the worry about a different class of hash functions? (It may be clear that I do not have a deep mathematical understanding of hash functions.) This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs. I forget -- what do we do on systems without urandom()? (E.g. Windows?) Windows has CryptGenRandom which is approximately equivalent. Let's worry about timing attacks another time okay? Agreed. As long as there isn't a gaping hole, I'm fine with that. Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. From a performance standpoint, this may still be better than adding 8 or 10 characters to every single hash operation, since most hashes are over short strings. It is important that this function touches every character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest. But I like the idea of a bigger random seed from which we deterministically pick some part. Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of just brute force the key. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it? We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
It's worth pointing out that if the salt is somehow exposed to an attacker, or is guessable, much of the benefit goes away. It's likely that a timing attack could be used to discover the salt if it is fixed per machine or process over a long period of time. If a salt is generally fixed per machine, but varies from machine-to-machine, I think we'll see an influx of frustrated devs who have something that works perfectly on their machine but not for others. It doesn't matter that they're doing it wrong, we'll still have to deal with them as a community. This seems like an argument in favor of randomizing it at runtime by default, so it fails early for them. Allowing an environment and command line override makes sense, as it allows users to rotate the salt as frequently as their needs dictate. -Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com