You can use UTF-8 strings.
FYI here is an extract from an X509 guide about DN encodings ... enjoy ;-)

"Character strings are used in various places (most notably in DNs), and are
encumbered by the fact that ASN.1 defines a whole series of odd subsets of
ASCII/ISO 646 as character string types, but only provides a few peculiar and
strange oddball character encodings for anything outside this limited character
range.
...
To use the example of DNs, the allowed string types are:


DirectoryString ::= CHOICE {
teletexString TeletexString (SIZE (1..maxSize)),
printableString PrintableString (SIZE (1..maxSize)), bmpString BMPString (SIZE (1..maxSize)),
universalString UniversalString (SIZE (1..maxSize))
}


The easiest one to use, if you can get away with it, is IA5String, which is
basically 7-bit ASCII (including all the control codes), but with the dollar
sign potentially replaced with a "currency symbol". A more sensible
alternative is VisibleString (aka ISO646String), which is IA5String without the
control codes (this has the advantage that you can't use it to construct macro
viruses using ANSI control sequences). In the DirectoryString case, you have
to make do with PrintableString, which is one of the odd ASCII/ISO 646 subsets
(for example you can't encode an '@', which makes it rather challenging to
encode email addresses).


Beyond that there is the T.61/TeletexString, which can select different
character sets using escape codes (this is one of the aforementioned "peculiar
and strange oddball encodings"). The character sets are Japanese Kanji (JIS C
6226-1983, set No. 87), Chinese (GB 2312-80, set No. 58), and Greek, using
shifting codes specified in T.61 or the international version, ISO 6937
(strictly speaking T61String isn't defined in T.61 but in X.680, which defines
it by profiling ISO 2022 character set switching). Some of the characters have
a variable-length encoding (so it takes 2 bytes to encode a character, with the
interpretation being done in a context-specific manner). The problem isn't
helped by the fact that the T.61 specification has changed over the years as
new character sets were added, and since the T.61 spec has now been withdrawn
by the ITU there's no real way to find out exactly what is supposed to be in
there (but see the previous comment on T.61 vs T61String - a T61String isn't
really a T.61 string). Even using straight 8859-1 in a T61String doesn't
always work, for example the 8859-1 character code for the Norwegian OE
(slashed O) is defined using a T.61 escape sequence which, if present in a
certificate, may cause a directory to reject the certificate.
...
For those who haven't reached for a sick bag yet, one definition of T61String
is given in ISO 1990 X.208 which indicates that it contains registered
character sets 87, 102 (a minimalist version of ASCII), 103 (a character set
with the infamous "floating diacritics" which means things like accented
characters are encoded as "<add an accent to the next character> + <character>"
rather than with a single character code), 106 and 107 (two useless sets
containing control characters which noone would put in a name), SPACE + DELETE.
The newer ITU-T 1997 and ISO 1998 X.680 adds the character sets 6, 126, 144,
150, 153, 156, 164, 165, and 168 (the reason for some of these additions is
because once a character set is registered it can never change except by
"clarifying" it, which produces a completely new character set with a new
number (as with sex, once you make a mistake you end up having to support it
for the rest of your life)). In fact there are even more definitions of
T61String than that: The original CCITT 1984 ASN.1 spec defined T61String by
reference to a real T.61 recommendation (from which finding the actual
permitted characters is challenging, to put it mildly), then the ISO 1986 spec
defined them by reference to the international register, then the CCITT 1988
spec changed them again (the ISO 1990 spec described above may be identical to
the CCITT 1988 one), and finally they were changed again for ISO/ITU-T 1994
(this 1994 spec may again be the same as ITU-T 1997 and ISO 1998). I'm not
making this up!
...
The encoding for this mess is specified in X.209 which indicates that the
default character sets at the start of a string are 102, 106 and 107, although
in theory you can't really make this assumption without the appropriate escape
sequences to invoke the correct character set. The general consensus amoung
the X.500/ISODE directory crowd is that you assume that set 103 is used by
default, although Microsoft and Netscape had other ideas for their LDAPv2
products. In certificates, the common practice seems to be to use straight
latin-1, which is set numbers 6 and 100, the latter not even being an allowed
T61String set.
...
Next are the BMPString and UniversalString, with BMPString having 16-bit
characters (UCS-2) and UniversalString having 32-bit characters (UCS-4), both
encoded in big-endian format. BMPString is a subset of UniversalString, being
the 16-bit character range in the 0/0 plane (ie the UniversalString characters
in which the 16 high bits are 0), corresponding to straight ISO 10646/Unicode
characters. The ASN.1 standard says that UniversalString should only be used
if the encoding possibilities are constrained, it's better to avoid it entirely
and only use BMPString/ISO 10646/Unicode.


However, there is a problem with this: at the moment few implementors know how
to handle or encode BMPStrings, and people have made all sorts of guesses as to
how Unicode strings should be encoded: with or without Unicode byte order marks
(BOMs), possibly with a fixed endianness, and with or without the terminating
null character.
...
The correct format for BMPStrings is: big-endian 16-bit characters, no Unicode
byte order marks (BOMs), and no terminating null character (ISO 8825-1 section
8.20).


An exception to this is PFX/PKCS #12, where the passwords are converted to a
Unicode BMPString before being hashed. However both Netscape and Microsoft's
early implementations treated the terminating null characters as being part of
the string, so the PKCS #12 standard was retroengineered to specify that the
null characters be included in the string.


A final string type which is presently only in the PKIX profile but which
should eventually appear elsewhere is UTF-8, which provides a means of encoding
7, 8, 16, and 32-bit characters into a single character string. Since ASN.1
already provides character string types which cover everything except some of
the really weird 32-bit characters which noone ever uses,
...
the least general encoding rule means that UTF-8 strings will practically never
be used. The original reason they were present in the PKIX profile is because
of an IETF rule which required that all new IETF standards support UTF-8, but a
much more compelling argument which recently emerged is that, since most of the
other ASN.1 character sets are completely unusable, UTF-8 would finally breathe
a bit of sanity into the ASN.1 character set nightmare. Unfortunately, because
it's quite a task to find ASN.1 compilers (let alone certificate handling
software) which supports UTF-8, you should avoid this string type for now. PKIX
realised the problems which would arise and specified a cutover date of 1
January 2004 for UTF-8 use. Some drafts have appeared which recommend the use
of RFC 2482 language tags, but these should be avoided since they have little
value (they're only needed for machine processing, if they appear in a text
string intended to be read by a human they'll either understand it or they
won't and a language tag won't help). In addition UTF-8 language tags are huge
(about 30 bytes) due to the fact that they're located out in plane 14 in the
character set (although I don't have the appropriate reference to hand, plane
14 is probably either Gehenna or Acheron), so the tag would be much larger than
the string being tagged in most cases.


One final problem with UTF-8 is that it shares some of the T.61 string problems
in which it's possible for a malicious encoder to evade checks on strings
either by using different code points which produce identical-looking
characters when displayed or by using suboptimal encodings (in ASN.1 terms,
non-distinguished encodings) of a code point. They are aided in this by the
standard, which says (page 47, section 3.8 of the Unicode 3.0 standard) that
"when converting from UTF-8 to a Unicode scalar value, implementations do not
need to check that the shortest encoding is being used. This simplifies the
conversion algorithm". What this means is that it's possible to encode a
particular character in a dozen different ways in order to evade a check which
uses a straight byte-by-byte comparison as specified in RFC 2459. Although
some libraries such as glibc 2.2 use "safe" UTF-8 decoders which will reject
non-distinguished encodings, it's not a good idea to assume that everyone does
this.


Because of these problems, the SET designers produced their own alternative,
SETString, for places were DNs weren't required for compatibility purposes.
The design goals for the SETString were to both provide the best coverage of
ASCII and national-language character sets, and also to minimise implementation
pain. The SETString type is defined as:


SETString ::= CHOICE {
   visibleString           VisibleString (SIZE (1..maxSIZE)),
   bmpString               BMPString (SIZE (1..maxSIZE))
   }

This provides complete ASCII/ISO 646 support using single byte characters, and
national language support through Unicode, which is in common use by industry.


In addition the SET designers decided to create their own version of the
DirectoryString which is a proper subset of the X.500 version. The initial
version was just an X.500 DirectoryString with a number of constraints applied
to it, but just before publication this was changed to:


DirectoryString ::= CHOICE {
   printableString         PrintableString (SIZE(1..maxSIZE)),
   bmpString               BMPString (SIZE(1..maxSIZE))
   }
                   You must unlearn what you have learned.
                       -- Yoda

It was felt that this improved readablility and interoperability (and sanity).
T61String was never seriously considered in the design, and UniversalString
with its four-byte characters had no identifiable industry support and required
too much overhead. If you want to produce certs which work for both generic
X.509 and SET, then using the SET version of the DirectoryString is a good
idea. It's trivial to convert an ISO 8859-1 T61String to a BMPString and back
(just add/subtract a 0 byte every other byte).


MISSI also subsets the string types, allowing only PrintableString and
T61String in DNs.

When dealing with these character sets you should use the "least inclusive" set
when trying to determine which encoding to use. This means trying to encode as
PrintableString first, then T61String, and finally BMPString/UniversalString.
SET requires that either PrintableStrings or BMPStrings be used, with
TeletexStrings and UniversalStrings being forbidden.


From this we can build the following set of recommendations:

- Use PrintableString if possible (or VisibleString or IA5String if this is
allowed, because it's rather more useful than PrintableString).
- If you use a T61String (and assuming you don't require SET compliance), avoid
the use of anything involving shifting and escape codes at any cost and just
treat it as a pure ISO 8859-1 string. If you need anything other than
8859-1, use a BMPString.
- If it won't go into one of the above, try for a BMPString.
- Avoid UniversalStrings.


Version 7 of the PKIX draft dropped the use of T61String altogether (probably
in response to this writeup :-), but this may be a bit extreme since the
extremely limited character range allowed by PrintableString will result in
many simple strings blowing out to BMPStrings, which causes problems on a
number of systems which have little Unicode support.


In 2004, you can switch to UTF-8 strings and forget about this entire section
of the guide."


Nuno Miguel Neves wrote:

Iive issued a certificate whose name had an accent (António).

However, both in Mozilla and Thunderbird, the certificate shows up with an empty name. :(

Is there anyway I can fix this, or must I enforce that all names are written without accents?

Thanks,



-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id56&alloc_id438&op=click
_______________________________________________
Openca-Users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/openca-users

Reply via email to