Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Paul McMillan
 We may use a different salt per dictionary.

If we're willing to re-hash everything on a per-dictionary basis. That
doesn't seem reasonable given our existing usage.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Paul McMillan
On Sat, Jan 21, 2012 at 4:19 PM, Jared Grubb jared.gr...@gmail.com wrote:
 I agree; it sounds really odd to throw an exception since nothing is actually 
 wrong and there's nothing the caller would do about it to recover anyway. 
 Rather than throwing an exception, maybe you just reseed the random value for 
 the hash

This is nonsense. You have to determine the random seed at startup,
and it has to be uniform for the entire life of the process. You can't
change it after Python has started.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Paul McMillan
 I may have a terminology problem here. I expect that a random seed must
 change every time it is used, otherwise the pseudorandom number generator
 using it just returns the same value each time. Should we be talking about a
 salt rather than a seed?

You should read the several other threads, the bug, as well as the
implementation and patch under discussion. Briefly, Python string
hashes are calculated once per string, and then used in many places.
You can't change the hash value for a string during program execution
without breaking everything. The proposed change modifies the starting
value of the hash function to include a process-wide randomly
generated seed. This seed is chosen randomly at runtime, but cannot
change once chosen. Using the seed changes the final output of the
hash to be unpredictable to an attacker, solving the underlying
problem.

Salt could also be an appropriate term here, but since salt is
generally changed on a per-use basis (a single process may use many
different salts), seed is more correct, since this value is only
chosen once per process.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Paul McMillan
 As I understand it, the way the attack works is that a *single*
 malicious request from the attacker can DoS the server by eating CPU
 resources while evaluating a massive collision chain induced in a dict
 by attacker supplied data. Explicitly truncating the collision chain
 boots them out almost immediately (likely with a 500 response for an
 internal server error), so they no longer affect other events, threads
 and processes on the same machine.

This is only true in the specific attack presented at 28c3. If an
attacker can insert data without triggering the attack, it's possible
to produce (in the example of a web application) urls that (regardless
of the request) always produce pathological behavior. For example, a
collection of pathological usernames might make it impossible to list
users (and so choose which ones to delete) without resorting to
removing the problem data at an SQL level.

This is why the simply throw an error solution isn't a complete fix.
Making portions of an interface unusable for regular users is clearly
a bad thing, and is clearly applicable to other types of poisoned data
as well. We need to detect collisions and work around them
transparently.

 However, such an app would have been crippled by the original DoS
 anyway, since its performance would have been gutted - the collision
 chain limiting just means it will trigger exceptions for the cases
 that would been insanely slow.

We can do better than saying it would have been broken before, it's
broken differently now. The universal hash function idea has merit,
and for practical purposes hash randomization would fix this too
(since colliding data is only likely to collide within a single
process, persistent poisoning is far less feasible).

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
 But how about precomputing the intermediate value (x)? The hash is (mostly)
 doing x = f(x, c) for each c in the input.

That's a fair point. If we go down that avenue, I think simply
choosing a random fixed starting value for x is the correct choice,
rather than computing an intermediate value.

 I sort of see your point, but I still think that if we could add as little
 per-character overhead as possible it would be best -- sometimes people *do*
 hash very long strings.

Yep, agreed. My original proposal did not adequately address this.

 Another option to consider would be to apply this change to some but
 not all of the rounds. If we added the seed lookup xor operation for
 only the first and last 5 values of x, we would still retain much of
 the benefit without adding much computational overhead for very long
 strings.

 I like that.

I believe this is a reasonable solution. An attacker could still
manipulate the internal state of long strings, but the additional
information at both ends should make that difficult to exploit. I'll
talk it over with the reviewers.

 We could also consider a less computationally expensive operation than
 the modulo for calculating the lookup index, like simply truncating to
 the correct number of bits.

 Sure. Thanks for thinking about all the details here!!

Again, I'll talk to the reviewers (and run the randomness test
battery) to be double-check that this doesn't affect the distribution
in some unexpected way, but I think it should be fine.

 PS. Is the collision-generator used in the attack code open source?

Not in working form, and they've turned down requests for it from
other projects that want to check their work. If it's imperative that
we have one, I can write one, but I'd rather not spend the effort if
we don't need it.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
 Different concern. What if someone were to have code implementing an
 external, persistent hash table, using Python's hash() function? They might
 have a way to rehash everything when a new version of Python comes along,
 but they would not be happy if hash() is different in each process. I
 somehow vaguely remember possibly having seen such code, or something else
 where a bit of random data was needed and hash() was used since it's so
 easily available.

I agree that there are use cases for allowing users to choose the
random seed, in much the same way it's helpful to be able to set it
for the random number generator. This should probably be something
that can be passed in at runtime. This feature would also be useful
for users who want to synchronize the hashes of multiple independent
processes, for whatever reason. For the general case though,
randomization should be on by default.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-01 Thread Paul McMillan
I fixed a couple things in my proposed algorithm:
https://gist.github.com/0a91e52efa74f61858b5

I had a typo, and used 21 instead of 12 for the size multiplier. We
definitely don't need 2MB random data.

The initialization of r was broken. Now it is an array of ints, so
there's no conversion when it's used. I've adjusted it so there's 8k
of random data, broken into 2048 ints.

I added a length-based seed to the initial value of x. This prevents
single-characters from being used to enumerate raw values from r.
This is similar to the change proposed by Christian Heimes.

Most importantly, I moved the xor with r[x % len_r] down a line.
Before, it wasn't being applied to the last character.

 Christian Heimes said:
 We also need to special case short strings. The above routine hands over
 the seed to attackers, if he is able to retrieve lots of single
 character hashes.

The updated code always includes at least 2 lookups from r, which I
believe solves the single-character enumeration problem. If we
special-case part of our hash function for short strings, we may get
suboptimal collisions between the two types of hashes.

I think Ruby uses FNV-1 with a salt, making it less vulnerable to
this. FNV is otherwise similar to our existing hash function.

For the record, cryptographically strong hash functions are in the
neighborhood of 400% slower than our existing hash function.

 Terry Reedy said:
 I understood Alexander Klink and Julian Wälde, hash...@alech.de, as saying
 that they consider that using a random non-zero start value is sufficient to
 make the hash non-vulnerable.

I've been talking to them. They're happy to look at our proposed
changes. They indicate that a non-zero start value is sufficient to
prevent the attack, but D. J. Bernstein disagrees with them. He also
has indicated a willingness to look at our solution.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Paul McMillan
 I'm not too concerned about a 3rd party being able to guess the random seed
 -- this would require much more effort on their part, since they would have
 to generate a new set of colliding keys each time they think they have
 guessed the hash

This is incorrect. Once an attacker has guessed the random seed, any
operation which reveals the ordering of hashed objects can be used to
verify the answer. JSON responses would be ideal. In fact, an attacker
can do a brute-force attack of the random seed offline. Once they have
the seed, generating collisions is a fast process.

The goal isn't perfection, but we need to do better than a simple
salt. I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5

This code is based on PyPy's implementation, but the concept is
universal. Rather than choosing a single short random seed per
process, we generate a much larger random seed (r). As we hash, we
deterministically choose a portion of that seed and incorporate it
into the hash process. This modification is a minimally intrusive
change to the existing hash function, and so should not introduce
unexpected side effects which might come from switching to a different
class of hash functions.

I've worked through this code with Alex Gaynor, Antoine Pitrou, and
Victor Stinner, and have asked several mathematicians and security
experts to review the concept. The reviewers who have gotten back to
me thus far have agreed that if the initial random seed is not flawed,
this should not overly change the properties of the hash function, but
should make it quite difficult for an attacker to deduce the necessary
information to predictably cause hash collisions. This function is not
designed to protect against timing attacks, but should be nontrivial
to reverse even with access to timing data.

Empirical testing shows that this unoptimized python implementation
produces ~10% slowdown in the hashing of ~20 character strings. This
is probably an acceptable trade off, and actually provides better
performance in the case of short strings than a high-entropy
fixed-length seed prefix.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-31 Thread Paul McMillan
 Still, it would represent an effort for the attacker of a much greater
 magnitude than the current attack. It's all a trade-off -- at some point
 it'll just be easier for the attacker to use some other vulnerability. Also
 the class of vulnerable servers would be greatly reduced.

I agree that doing anything is better than doing nothing. If we use
the earlier suggestion and prepend everything with a fixed-length
seed, we need quite a bit of entropy (and so a fairly long string) to
make a lasting difference.

 I'm not sure I understand this. What's the worry about a different class of
 hash functions? (It may be clear that I do not have a deep mathematical
 understanding of hash functions.)

This was mostly in reference to earlier suggestions of switching to
cityhash, or using btrees, or other more invasive changes. Python 2.X
is pretty stable and making large changes like that to the codebase
can have unpredictable effects. We know that the current hash function
works well (except for this specific problem), so it seems like the
best fix will be as minimal a modification as possible, to avoid
introducing bugs.

 I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Windows has CryptGenRandom which is approximately equivalent.

 Let's worry about timing attacks another time okay?
Agreed. As long as there isn't a gaping hole, I'm fine with that.

 Hm. I'm not sure I like the idea of extra arithmetic for every character
 being hashed.

From a performance standpoint, this may still be better than adding 8
or 10 characters to every single hash operation, since most hashes are
over short strings. It is important that this function touches every
character - if it only interacts with a subset of them, an attacker
can fix that subset and vary the rest.

 But I like the idea of a bigger random seed from which we
 deterministically pick some part.

Yeah. This makes it much harder to attack, since it very solidly
places the attacker outside the realm of just brute force the key.

 How about just initializing x to some
 subsequence of the seed determined by e.g. the length of the hashed string
 plus a few characters from it?

We did consider this, and if performance is absolutely the prime
directive, this (or a variant) may be the best option. Unfortunately,
the collision generator doesn't necessarily vary the length of the
string. Additionally, if we don't vary based on all the letters in the
string, an attacker can fix the characters that we do use and generate
colliding strings around them.

Another option to consider would be to apply this change to some but
not all of the rounds. If we added the seed lookup xor operation for
only the first and last 5 values of x, we would still retain much of
the benefit without adding much computational overhead for very long
strings.

We could also consider a less computationally expensive operation than
the modulo for calculating the lookup index, like simply truncating to
the correct number of bits.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-29 Thread Paul McMillan
It's worth pointing out that if the salt is somehow exposed to an
attacker, or is guessable, much of the benefit goes away. It's likely
that a timing attack could be used to discover the salt if it is fixed
per machine or process over a long period of time.

If a salt is generally fixed per machine, but varies from
machine-to-machine, I think we'll see an influx of frustrated devs who
have something that works perfectly on their machine but not for
others. It doesn't matter that they're doing it wrong, we'll still
have to deal with them as a community. This seems like an argument in
favor of randomizing it at runtime by default, so it fails early for
them.

Allowing an environment and command line override makes sense, as it
allows users to rotate the salt as frequently as their needs dictate.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com