On Mon, Sep 16, 2013 at 7:26 PM, Wendy Lin <[email protected]> wrote: > On 16 September 2013 18:39, Roland Mainz <[email protected]> wrote: >> While doing some GB18030 testing I found a disturbing issue: >> A lot of calls to the |*mb*()| functions are done without thinking >> about the current shift state. The issue is that this state is a >> hidden global variable and may easily be overlooked (the issue that >> UTF-8 can recover from invalid shift states makes this worse since >> UTF-8 locales won't suffer from this problem) ... which causes >> problems for Shift-State depending encodings like >> GBK/GB18030/ShiftJis. >> >> My preferred solution would be to change the current libast mb API to >> always take a |mbstate_t| argument. This would fix this issue (by >> making the shift state explicit), fix issues with nesting calls, e.g. >> if we are in a specific shift state and then call a utility function >> which operates on a different string ... and fix thread-safeness >> issues with the "hidden" global variable containing the current shift >> state... > > Well, this may explain why ksh93 sometimes has lapses when it wants to > process characters which are encoded not with UTF8. bash 4 handles > this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0, > sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb > function without mbstate_t can render your whole application useless.
The problem is *NOT* the choice of function... the problem is that we use a global (or semi-global) state variable. Technically each single utility function (this includes all multibyte-aware function in libshell and libcmd, too) should have it's own |mbstate_t|. A major issue we found is that a multibyte character string is being processed... and in the middle of that processing we call something else which operates on a different multibyte character stream. As result of this nesting or even calling other ((buggy !) system-)library functions the |mbstate_t| state used by the caller gets screwed-up. And that's causing trouble all over the place. There are two reason this doesn't cause much trouble yet: 1. Most people use UTF-8-based locales, which have recovery built into the encoding itself 2. Many system i18n multibyte handling functions automagically recover from invalid states without returning an error. But not all can do it (e.g. because the encoding isn't designed in such a way) or will do it (to keep the code simple+easy+fast and force correct programming). For example some GBK/GB18030 implementations can do it (like on Solaris using IBM's OpenGroup i18n/multibyte implementation) but not Illumos/OpenSolaris&&FreeBSD which use a different i18n/multibyte implementation. As result some stuff works on Solaris but causes endless loops or data corruption on Illumos/OpenSolaris/FreeBSD/etc. ... ;-( > Q: Why doesn't POSIX deprecate mb functions which do not use a > mbstate_t? The mistake ksh93 does is easy to make and so hard to > rectify. Erm... for simple utilities the global state _sounds_ like an easy choice... but given the trouble you can end-up by simply ignoring the issue that multibyte encodings can have a state I wish the functions would've never been invented... ;-/ ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
