Re: Bug reported regarding Unicode handling in email address

Steffen Nurpmeso Mon, 14 Jun 2021 09:26:39 -0700

Hello Ralph.

Ralph Corderoy wrote in
 <[email protected]>:
 |> It is really terrible that all this is a black box; but how to do it
 |> right?  In the end, with Unicode aka UTF-8 locales, we come to a point
 |> where it would really be doable, since now nl_langinfo(CODESET), or
 |> whatever input character set you pass to iconv(3), could actually be
 |> warped to the entire set of character classification functions, like
 |> [w]isspace() etc.  But so you have black box iconv here, and LC
 |> derived classification possibilities there.  You are condemned to live
 |> in the user's locale.
 |
 |Why not iconv(3) the input from the user's locale, the MIME part's
 |charset, etc., to UTF-8, work internally, and then iconv() again on the
 |way back out?  I feel you're telling me above, but I don't quite get
 |your point.


Sure, convert to Unicode, work in Unicode, convert back, that is
the way to go.  It is still hard to do with POSIX let alone ISO.
You need an UTF-8 locale you can actively select, POSIX/ISO
functions do not support graphemes, and __STDC_ISO_10646__ is an
option, so that you cannot simply code some tables on your own to
fill the gaps, because looking at the wchar_t codepoints may not
give you a Unicode "codepoint" (though maybe all do it like that
so in practice you could make this a precondition).  I had to
look, but i think having the "WCHAR_T" iconv(3) target is
absolutely non-portable also.  So portable code has to scratch the
portable ISO/POSIX functions and unroll its own stuff, nay?  :)
And finally not all character sets truly support roundtripping
from/to Unicode, but in reality this should not hurt also.
Really, the older i get the more i think that UTF-16 is not the
worst decision regarding Unicode.  Surrogate pairs have to be
handled, but for UTF-8 you always have to live with multibyte
anyway.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: Bug reported regarding Unicode handling in email address

Reply via email to