Ken Hornstein wrote in <[email protected]>: |>Last i looked they use a gigantic chunk of memory in mbstate_t or |>so (128 byte?). | |128 bytes is considered 'gigantic'? :-)
Per mbstate, yes! It is near a citation of Christos Zoulas of NetBSD in fact. Here on my GNU/Linux #?0|kent:steffen$ pkginfo -o /usr/include/bits/types/__mbstate_t.h Package File glibc usr/include/bits/types/__mbstate_t.h #?0|kent:steffen$ prt-get info glibc ... Version: 2.32 Release: 3 Description: The C library used in the GNU system URL: http://www.gnu.org/software/libc/ ... it is typedef struct { int __count; union { __WINT_TYPE__ __wch; char __wchb[4]; } __value; /* Value so far. */ } __mbstate_t; A "git show origin/trunk:sys/sys/ansi.h" for NetBSD shows /* * mbstate_t is an opaque object to keep conversion state, during multibyte * stream conversions. The content must not be referenced by user programs. */ typedef union { __int64_t __mbstateL; /* for alignment */ char __mbstate8[128]; } __mbstate_t; That is quite a difference. mbstate you do not need that often, but that it is like so says something by itself. That was really near a citation of Zoulas words now. |While I am not a huge fan of the POSIX locale functions, thankfully we can |mostly get by without them. Basically we use iconv() to convert from the |source character set to the native character set, and we have a small |amount of mbtowc() and wcwidth() to handle multibyte character sets and |figure out column width (and really, we only do UTF-8 well). That almost matches mye mailx clone. We need it for the HTML filter and the built-in line editor in addition. Really stateful encoding i think it would not do well, but i have forgotten about that. Right to left scripts is also inacceptable i'd say. For me i can say: of course that is bad; a "modern" program would avoid back- and forth conversion but once for input, and once for output. Then you can argue in what format you live, Windows and MirBSD use UTF-16, most Unix and perl use UTF-8, i was all for UTF-32 at times, but i was wrong with that. Heck that stuff is very complicated (not to talk about calendars of all sorts etc.), i thought i would finally get used to it doing OSS, after all that only-German only-European-Time only-ISO-8859-1(5) before. But it turns out you crawl circles around all that working with C-style strings. In fact i now have bug reports that newer versions do not even work with 8-bit european locales, after i turned to use fallbacks for isspace() etc that isascii() first (avoid un/signed char extension problems) on systems that do not provide all of HAVE_SETLOCALE && HAVE_C90AMEND1 && HAVE_WCWIDTH. It is really terrible that all this is a black box; but how to do it right? In the end, with Unicode aka UTF-8 locales, we come to a point where it would really be doable, since now nl_langinfo(CODESET), or whatever input character set you pass to iconv(3), could actually be warped to the entire set of character classification functions, like [w]isspace() etc. But so you have black box iconv here, and LC derived classification possibilities there. You are condemned to live in the user's locale. It seems serious programs have to move on and use the ICU library, that opens that door that always was superficial (but of course, "reverse engineering" from a character set to classification functions, it requires a science to know whether it works out). The locale functions are just too restricted, and wcwidth(3) is not even ISO. Though a true port to non-POSIX (non-ASCII) is not on the table anyway, for the mailx clone i maintain. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
