On 06/27/2011 06:30 PM, Sampo Syreeni wrote:
On 2011-06-20, Marsh Ray wrot
I once looked up the Unicode algorithm for some basic "case
insensitive" string comparison... 40 pages!
Isn't that precisely why e.g. Peter Gutmann once wrote against the
canonicalization (in the Unicode context, "normalization") that ISO
derived crypto protocols do, in favour of the "bytes are bytes" approach
that PGP/GPG takes?
Yes, but in most actual systems the strings are going to get handled.
It's more a question of whether or not your protocol specification
defines the format it's expecting.
Humans tend to not define text very precisely and computers don't work
with it directly anyway, they only work with encoded representations of
text as character data. Even a simple accented character in a word or
name can be represented in several different ways.
Many devs (particularly Unixers :-) in the US, AU, and NZ have gotten
away with the "7 bit ASCII" assumption for a long time, but most of the
rest of the world has to deal with locales, code pages, and multi-byte
encodings. This seemed to allow older IETF protocol specs to often get
away without a rigorous treatment of the character data encoding issues.
(I suspect one factor in the lead of the English-speaking world in the
development of 20th century computers and protocols is because we could
get by with one of the smallest character sets.)
Let's say you're writing a piece of code like:
if (username == "root")
{
// avoid doing something insecure with root privs
}
The logic of this example is probably broken in important ways but the
point remains: sometimes we need to compare usernames for equality in
contexts that have security implications. You can only claim "bytes are
bytes" up until the point that the customer says they have a directory
server which compares usernames "case insensitively".
For most things "verbatim binary" is the right choice. However, a
password or pass phrase is specifically character data which is the
result of a user input method.
If you want to do crypto, just do crypto on the bits/bytes. If you
really have to, you can tag the intended format for forensic purposes
and sign your intent. But don't meddle with your given bits.
Canonicalization/normalization is simply too hard to do right or even to
analyse to have much place in protocol design.
Consider RAIDUS.
The first RFC http://tools.ietf.org/html/rfc2058#section-5.2
says nothing about the encoding of the character data of the password
field, it just treats it as a series of octets. So what do you do when
implementing RADIUS on an OS that gives user input to your application
with UTF-16LE encoding? If you "don't meddle with your given bits" and
just pass them on to the protocol layer, they are almost guaranteed to
be non-interoperable.
Later RFCs http://tools.ietf.org/html/rfc2865
have added in most places "It is recommended that the message contain
UTF-8 encoded 10646 characters." I think this is a really practical
middle ground. Interestingly, it doesn't say this for the password
field, likely because the authors figured it would break some existing
underspecified behavior.
So exactly which characters are allowed in passwords and how are they to
be represented for interoperable RADIUS implementations? I have no idea,
and I help maintain one!
Consequently, we can hardly blame users for not using special characters
in their passwords.
- Marsh
_______________________________________________
cryptography mailing list
[email protected]
http://lists.randombit.net/mailman/listinfo/cryptography