Re: [Openca-Users] Certificate with accents

R??mi Cohen-Scali Mon, 16 Feb 2004 08:11:30 -0800

You can use UTF-8 strings.
FYI here is an extract from an X509 guide about DN encodings ... enjoy ;-)

"Character strings are used in various places (most notably in DNs), and are encumbered by the fact that ASN.1 defines a whole series of odd subsets of ASCII/ISO 646 as character string types, but only provides a few peculiar and strange oddball character encodings for anything outside this limited character range. ... To use the example of DNs, the allowed string types are:

DirectoryString ::= CHOICE { teletexString TeletexString (SIZE (1..maxSize)), printableString PrintableString (SIZE (1..maxSize)), bmpString BMPString (SIZE (1..maxSize)), universalString UniversalString (SIZE (1..maxSize)) }

The easiest one to use, if you can get away with it, is IA5String, which is basically 7-bit ASCII (including all the control codes), but with the dollar sign potentially replaced with a "currency symbol". A more sensible alternative is VisibleString (aka ISO646String), which is IA5String without the control codes (this has the advantage that you can't use it to construct macro viruses using ANSI control sequences). In the DirectoryString case, you have to make do with PrintableString, which is one of the odd ASCII/ISO 646 subsets (for example you can't encode an '@', which makes it rather challenging to encode email addresses).

Beyond that there is the T.61/TeletexString, which can select different character sets using escape codes (this is one of the aforementioned "peculiar and strange oddball encodings"). The character sets are Japanese Kanji (JIS C 6226-1983, set No. 87), Chinese (GB 2312-80, set No. 58), and Greek, using shifting codes specified in T.61 or the international version, ISO 6937 (strictly speaking T61String isn't defined in T.61 but in X.680, which defines it by profiling ISO 2022 character set switching). Some of the characters have a variable-length encoding (so it takes 2 bytes to encode a character, with the interpretation being done in a context-specific manner). The problem isn't helped by the fact that the T.61 specification has changed over the years as new character sets were added, and since the T.61 spec has now been withdrawn by the ITU there's no real way to find out exactly what is supposed to be in there (but see the previous comment on T.61 vs T61String - a T61String isn't really a T.61 string). Even using straight 8859-1 in a T61String doesn't always work, for example the 8859-1 character code for the Norwegian OE (slashed O) is defined using a T.61 escape sequence which, if present in a certificate, may cause a directory to reject the certificate. ... For those who haven't reached for a sick bag yet, one definition of T61String is given in ISO 1990 X.208 which indicates that it contains registered character sets 87, 102 (a minimalist version of ASCII), 103 (a character set with the infamous "floating diacritics" which means things like accented characters are encoded as "<add an accent to the next character> + <character>" rather than with a single character code), 106 and 107 (two useless sets containing control characters which noone would put in a name), SPACE + DELETE. The newer ITU-T 1997 and ISO 1998 X.680 adds the character sets 6, 126, 144, 150, 153, 156, 164, 165, and 168 (the reason for some of these additions is because once a character set is registered it can never change except by "clarifying" it, which produces a completely new character set with a new number (as with sex, once you make a mistake you end up having to support it for the rest of your life)). In fact there are even more definitions of T61String than that: The original CCITT 1984 ASN.1 spec defined T61String by reference to a real T.61 recommendation (from which finding the actual permitted characters is challenging, to put it mildly), then the ISO 1986 spec defined them by reference to the international register, then the CCITT 1988 spec changed them again (the ISO 1990 spec described above may be identical to the CCITT 1988 one), and finally they were changed again for ISO/ITU-T 1994 (this 1994 spec may again be the same as ITU-T 1997 and ISO 1998). I'm not making this up! ... The encoding for this mess is specified in X.209 which indicates that the default character sets at the start of a string are 102, 106 and 107, although in theory you can't really make this assumption without the appropriate escape sequences to invoke the correct character set. The general consensus amoung the X.500/ISODE directory crowd is that you assume that set 103 is used by default, although Microsoft and Netscape had other ideas for their LDAPv2 products. In certificates, the common practice seems to be to use straight latin-1, which is set numbers 6 and 100, the latter not even being an allowed T61String set. ... Next are the BMPString and UniversalString, with BMPString having 16-bit characters (UCS-2) and UniversalString having 32-bit characters (UCS-4), both encoded in big-endian format. BMPString is a subset of UniversalString, being the 16-bit character range in the 0/0 plane (ie the UniversalString characters in which the 16 high bits are 0), corresponding to straight ISO 10646/Unicode characters. The ASN.1 standard says that UniversalString should only be used if the encoding possibilities are constrained, it's better to avoid it entirely and only use BMPString/ISO 10646/Unicode.

However, there is a problem with this: at the moment few implementors know how to handle or encode BMPStrings, and people have made all sorts of guesses as to how Unicode strings should be encoded: with or without Unicode byte order marks (BOMs), possibly with a fixed endianness, and with or without the terminating null character. ... The correct format for BMPStrings is: big-endian 16-bit characters, no Unicode byte order marks (BOMs), and no terminating null character (ISO 8825-1 section 8.20).

An exception to this is PFX/PKCS #12, where the passwords are converted to a Unicode BMPString before being hashed. However both Netscape and Microsoft's early implementations treated the terminating null characters as being part of the string, so the PKCS #12 standard was retroengineered to specify that the null characters be included in the string.

A final string type which is presently only in the PKIX profile but which should eventually appear elsewhere is UTF-8, which provides a means of encoding 7, 8, 16, and 32-bit characters into a single character string. Since ASN.1 already provides character string types which cover everything except some of the really weird 32-bit characters which noone ever uses, ... the least general encoding rule means that UTF-8 strings will practically never be used. The original reason they were present in the PKIX profile is because of an IETF rule which required that all new IETF standards support UTF-8, but a much more compelling argument which recently emerged is that, since most of the other ASN.1 character sets are completely unusable, UTF-8 would finally breathe a bit of sanity into the ASN.1 character set nightmare. Unfortunately, because it's quite a task to find ASN.1 compilers (let alone certificate handling software) which supports UTF-8, you should avoid this string type for now. PKIX realised the problems which would arise and specified a cutover date of 1 January 2004 for UTF-8 use. Some drafts have appeared which recommend the use of RFC 2482 language tags, but these should be avoided since they have little value (they're only needed for machine processing, if they appear in a text string intended to be read by a human they'll either understand it or they won't and a language tag won't help). In addition UTF-8 language tags are huge (about 30 bytes) due to the fact that they're located out in plane 14 in the character set (although I don't have the appropriate reference to hand, plane 14 is probably either Gehenna or Acheron), so the tag would be much larger than the string being tagged in most cases.

One final problem with UTF-8 is that it shares some of the T.61 string problems in which it's possible for a malicious encoder to evade checks on strings either by using different code points which produce identical-looking characters when displayed or by using suboptimal encodings (in ASN.1 terms, non-distinguished encodings) of a code point. They are aided in this by the standard, which says (page 47, section 3.8 of the Unicode 3.0 standard) that "when converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm". What this means is that it's possible to encode a particular character in a dozen different ways in order to evade a check which uses a straight byte-by-byte comparison as specified in RFC 2459. Although some libraries such as glibc 2.2 use "safe" UTF-8 decoders which will reject non-distinguished encodings, it's not a good idea to assume that everyone does this.

Because of these problems, the SET designers produced their own alternative, SETString, for places were DNs weren't required for compatibility purposes. The design goals for the SETString were to both provide the best coverage of ASCII and national-language character sets, and also to minimise implementation pain. The SETString type is defined as:

SETString ::= CHOICE {
   visibleString           VisibleString (SIZE (1..maxSIZE)),
   bmpString               BMPString (SIZE (1..maxSIZE))
   }

This provides complete ASCII/ISO 646 support using single byte characters, and national language support through Unicode, which is in common use by industry.

In addition the SET designers decided to create their own version of the DirectoryString which is a proper subset of the X.500 version. The initial version was just an X.500 DirectoryString with a number of constraints applied to it, but just before publication this was changed to:

DirectoryString ::= CHOICE {
   printableString         PrintableString (SIZE(1..maxSIZE)),
   bmpString               BMPString (SIZE(1..maxSIZE))
   }
                   You must unlearn what you have learned.
                       -- Yoda

It was felt that this improved readablility and interoperability (and sanity). T61String was never seriously considered in the design, and UniversalString with its four-byte characters had no identifiable industry support and required too much overhead. If you want to produce certs which work for both generic X.509 and SET, then using the SET version of the DirectoryString is a good idea. It's trivial to convert an ISO 8859-1 T61String to a BMPString and back (just add/subtract a 0 byte every other byte).

MISSI also subsets the string types, allowing only PrintableString and
T61String in DNs.

When dealing with these character sets you should use the "least inclusive" set when trying to determine which encoding to use. This means trying to encode as PrintableString first, then T61String, and finally BMPString/UniversalString. SET requires that either PrintableStrings or BMPStrings be used, with TeletexStrings and UniversalStrings being forbidden.

From this we can build the following set of recommendations:

- Use PrintableString if possible (or VisibleString or IA5String if this is allowed, because it's rather more useful than PrintableString). - If you use a T61String (and assuming you don't require SET compliance), avoid the use of anything involving shifting and escape codes at any cost and just treat it as a pure ISO 8859-1 string. If you need anything other than 8859-1, use a BMPString. - If it won't go into one of the above, try for a BMPString. - Avoid UniversalStrings.

Version 7 of the PKIX draft dropped the use of T61String altogether (probably in response to this writeup :-), but this may be a bit extreme since the extremely limited character range allowed by PrintableString will result in many simple strings blowing out to BMPStrings, which causes problems on a number of systems which have little Unicode support.

In 2004, you can switch to UTF-8 strings and forget about this entire section of the guide."

Nuno Miguel Neves wrote:

Iive issued a certificate whose name had an accent (António).

However, both in Mozilla and Thunderbird, the certificate shows up with an empty name. :(

Is there anyway I can fix this, or must I enforce that all names are written without accents?

Thanks,

-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id56&alloc_id438&op=click
_______________________________________________
Openca-Users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/openca-users

Re: [Openca-Users] Certificate with accents

Reply via email to