Re: Bug reported regarding Unicode handling in email address

Steffen Nurpmeso Sun, 13 Jun 2021 15:37:24 -0700

Ken Hornstein wrote in
 <[email protected]>:
 |>Last i looked they use a gigantic chunk of memory in mbstate_t or
 |>so (128 byte?).
 |
 |128 bytes is considered 'gigantic'? :-)


Per mbstate, yes!  It is near a citation of Christos Zoulas of
NetBSD in fact.  Here on my GNU/Linux

  #?0|kent:steffen$ pkginfo -o /usr/include/bits/types/__mbstate_t.h
  Package  File
  glibc    usr/include/bits/types/__mbstate_t.h
  #?0|kent:steffen$ prt-get info glibc
  ...
  Version:      2.32
  Release:      3
  Description:  The C library used in the GNU system
  URL:          http://www.gnu.org/software/libc/
  ...

it is

  typedef struct
  {
    int __count;
    union
    {
      __WINT_TYPE__ __wch;
      char __wchb[4];
    } __value;            /* Value so far.  */
  } __mbstate_t;

A "git show origin/trunk:sys/sys/ansi.h" for NetBSD shows

  /*
   * mbstate_t is an opaque object to keep conversion state, during multibyte
   * stream conversions.  The content must not be referenced by user programs.
   */
  typedef union {
          __int64_t __mbstateL; /* for alignment */
          char __mbstate8[128];
  } __mbstate_t;

That is quite a difference.  mbstate you do not need that often,
but that it is like so says something by itself.  That was really
near a citation of Zoulas words now.

 |While I am not a huge fan of the POSIX locale functions, thankfully we can
 |mostly get by without them.  Basically we use iconv() to convert from the
 |source character set to the native character set, and we have a small
 |amount of mbtowc() and wcwidth() to handle multibyte character sets and
 |figure out column width (and really, we only do UTF-8 well).

That almost matches mye mailx clone.  We need it for the HTML
filter and the built-in line editor in addition.  Really stateful
encoding i think it would not do well, but i have forgotten about
that.  Right to left scripts is also inacceptable i'd say.

For me i can say: of course that is bad; a "modern" program would
avoid back- and forth conversion but once for input, and once for
output.  Then you can argue in what format you live, Windows and
MirBSD use UTF-16, most Unix and perl use UTF-8, i was all for
UTF-32 at times, but i was wrong with that.  Heck that stuff is
very complicated (not to talk about calendars of all sorts etc.),
i thought i would finally get used to it doing OSS, after all that
only-German only-European-Time only-ISO-8859-1(5) before.  But it
turns out you crawl circles around all that working with C-style
strings.  In fact i now have bug reports that newer versions do
not even work with 8-bit european locales, after i turned to use
fallbacks for isspace() etc that isascii() first (avoid un/signed
char extension problems) on systems that do not provide all of
HAVE_SETLOCALE && HAVE_C90AMEND1 && HAVE_WCWIDTH.

It is really terrible that all this is a black box; but how to do
it right?  In the end, with Unicode aka UTF-8 locales, we come to
a point where it would really be doable, since now
nl_langinfo(CODESET), or whatever input character set you pass to
iconv(3), could actually be warped to the entire set of character
classification functions, like [w]isspace() etc.  But so you have
black box iconv here, and LC derived classification possibilities
there.  You are condemned to live in the user's locale.
It seems serious programs have to move on and use the ICU library,
that opens that door that always was superficial (but of course,
"reverse engineering" from a character set to classification
functions, it requires a science to know whether it works out).
The locale functions are just too restricted, and wcwidth(3) is
not even ISO.  Though a true port to non-POSIX (non-ASCII) is not
on the table anyway, for the mailx clone i maintain.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: Bug reported regarding Unicode handling in email address

Reply via email to