On Mon, Sep 16, 2013 at 7:59 PM, Roland Mainz <[email protected]> wrote: > On Mon, Sep 16, 2013 at 7:26 PM, Wendy Lin <[email protected]> wrote: >> On 16 September 2013 18:39, Roland Mainz <[email protected]> wrote: >>> While doing some GB18030 testing I found a disturbing issue: >>> A lot of calls to the |*mb*()| functions are done without thinking >>> about the current shift state. The issue is that this state is a >>> hidden global variable and may easily be overlooked (the issue that >>> UTF-8 can recover from invalid shift states makes this worse since >>> UTF-8 locales won't suffer from this problem) ... which causes >>> problems for Shift-State depending encodings like >>> GBK/GB18030/ShiftJis. >>> >>> My preferred solution would be to change the current libast mb API to >>> always take a |mbstate_t| argument. This would fix this issue (by >>> making the shift state explicit), fix issues with nesting calls, e.g. >>> if we are in a specific shift state and then call a utility function >>> which operates on a different string ... and fix thread-safeness >>> issues with the "hidden" global variable containing the current shift >>> state... >> >> Well, this may explain why ksh93 sometimes has lapses when it wants to >> process characters which are encoded not with UTF8. bash 4 handles >> this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0, >> sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb >> function without mbstate_t can render your whole application useless. > > The problem is *NOT* the choice of function... the problem is that we > use a global (or semi-global) state variable. > Technically each single utility function (this includes all > multibyte-aware function in libshell and libcmd, too) should have it's > own |mbstate_t|. > A major issue we found is that a multibyte character string is being > processed... and in the middle of that processing we call something > else which operates on a different multibyte character stream. As > result of this nesting or even calling other ((buggy !) > system-)library functions the |mbstate_t| state used by the caller > gets screwed-up. And that's causing trouble all over the place. > There are two reason this doesn't cause much trouble yet: > 1. Most people use UTF-8-based locales, which have recovery built into > the encoding itself > 2. Many system i18n multibyte handling functions automagically recover > from invalid states without returning an error. But not all can do it > (e.g. because the encoding isn't designed in such a way) or will do it > (to keep the code simple+easy+fast and force correct programming). > > For example some GBK/GB18030 implementations can do it (like on > Solaris using IBM's OpenGroup i18n/multibyte implementation) but not > Illumos/OpenSolaris&&FreeBSD which use a different i18n/multibyte > implementation. As result some stuff works on Solaris but causes > endless loops or data corruption on Illumos/OpenSolaris/FreeBSD/etc. > ... ;-( > >> Q: Why doesn't POSIX deprecate mb functions which do not use a >> mbstate_t? The mistake ksh93 does is easy to make and so hard to >> rectify. > > Erm... for simple utilities the global state _sounds_ like an easy > choice... but given the trouble you can end-up by simply ignoring the > issue that multibyte encodings can have a state I wish the functions > would've never been invented... ;-/
Glenn: Are the following functions *always* available when multibyte support is enabled for all platforms you can test: -- snip -- mbrlen() mbrtowc() wcrtomb() wcsrtombs() mbsrtowcs() -- snip -- If that's true then most of the patch is just a simple switch-over, add states and maybe add some new functions which accept a state object (for cases where we start in a middle of a string and have to restart over and over again) ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
