On Mon, Sep 16, 2013 at 7:59 PM, Roland Mainz <[email protected]> wrote:
> On Mon, Sep 16, 2013 at 7:26 PM, Wendy Lin <[email protected]> wrote:
>> On 16 September 2013 18:39, Roland Mainz <[email protected]> wrote:
>>> While doing some GB18030 testing I found a disturbing issue:
>>> A lot of calls to the |*mb*()| functions are done without thinking
>>> about the current shift state. The issue is that this state is a
>>> hidden global variable and may easily be overlooked (the issue that
>>> UTF-8 can recover from invalid shift states makes this worse since
>>> UTF-8 locales won't suffer from this problem) ... which causes
>>> problems for Shift-State depending encodings like
>>> GBK/GB18030/ShiftJis.
>>>
>>> My preferred solution would be to change the current libast mb API to
>>> always take a |mbstate_t| argument. This would fix this issue (by
>>> making the shift state explicit), fix issues with nesting calls, e.g.
>>> if we are in a specific shift state and then call a utility function
>>> which operates on a different string ... and fix thread-safeness
>>> issues with the "hidden" global variable containing the current shift
>>> state...
>>
>> Well, this may explain why ksh93 sometimes has lapses when it wants to
>> process characters which are encoded not with UTF8. bash 4 handles
>> this flawlessly, but only since they use mbstate_t ps;memset (&ps, 0,
>> sizeof (mbstate_t));wcrtomb() everywhere. Even using a single mb
>> function without mbstate_t can render your whole application useless.
>
> The problem is *NOT* the choice of function... the problem is that we
> use a global (or semi-global) state variable.
> Technically each single utility function (this includes all
> multibyte-aware function in libshell and libcmd, too) should have it's
> own |mbstate_t|.
> A major issue we found is that a multibyte character string is being
> processed... and in the middle of that processing we call something
> else which operates on a different multibyte character stream. As
> result of this nesting or even calling other ((buggy !)
> system-)library functions the |mbstate_t| state used by the caller
> gets screwed-up. And that's causing trouble all over the place.
> There are two reason this doesn't cause much trouble yet:
> 1. Most people use UTF-8-based locales, which have recovery built into
> the encoding itself
> 2. Many system i18n multibyte handling functions automagically recover
> from invalid states without returning an error. But not all can do it
> (e.g. because the encoding isn't designed in such a way) or will do it
> (to keep the code simple+easy+fast and force correct programming).
>
> For example some GBK/GB18030 implementations can do it (like on
> Solaris using IBM's OpenGroup i18n/multibyte implementation) but not
> Illumos/OpenSolaris&&FreeBSD which use a different i18n/multibyte
> implementation. As result some stuff works on Solaris but causes
> endless loops or data corruption on Illumos/OpenSolaris/FreeBSD/etc.
> ... ;-(
>
>> Q: Why doesn't POSIX deprecate mb functions which do not use a
>> mbstate_t? The mistake ksh93 does is easy to make and so hard to
>> rectify.
>
> Erm... for simple utilities the global state _sounds_ like an easy
> choice... but given the trouble you can end-up by simply ignoring the
> issue that multibyte encodings can have a state I wish the functions
> would've never been invented... ;-/

Glenn: Are the following functions *always* available when multibyte
support is enabled for all platforms you can test:
-- snip --
mbrlen()
mbrtowc()
wcrtomb()
wcsrtombs()
mbsrtowcs()
-- snip --

If that's true then most of the patch is just a simple switch-over,
add states and maybe add some new functions which accept a state
object (for cases where we start in a middle of a string and have to
restart over and over again)

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to