On Tue, Apr 16, 2013 at 11:03:41PM +0200, Roland Mainz wrote:
> >
> > More data: if I force the ksh93 builtin "iconv" to read from a pipe I
> > get a warning about an incomplete multibyte sequence...
> > -- snip --
> > $ LC_ALL=en_US.UTF-8 ../build_i386_32bit_debug/arch/linux.i386/bin/ksh
> > -c 'builtin iconv ; cat
> > /tmp/astksh20130409_suse123_32bit_builtin_iconv_hang1.txt | iconv -f
> > UTF-8 >/tmp/zzz2 ; true'
> > iconv: incomplete multibyte sequence at offset 32767 [Invalid argument]
> > -- snip --
> > ... it seems the issue is somehow related to the difference that
> > "iconv" reading a plain file uses |mmap()| ... triggering a different
> > codepath than reading from a pipe.
> >
> > Question is now... who is correct ? GNU "iconv" doesn't seem to print
> > any warnings/errors for the input file while AST "iconv" prints a
> > warning when reading from a pipe and hangs when reading via |mmap()|
> > ...
> > ... another issue is... why does this only happen for 32bit builds ?
> 
> The issue does happen for 64bit builds, too.
> 
> It seems it happens (for 32bit builds) when a multibyte character is
> exactly at a 32k buffer boundary... one part of the multibyte
> character is in the first buffer and the rest of the multibyte
> character's bytes is in the 2nd buffer.
> 
> Here is a reduced/standalone testcase:
> -- snip --
> $ ksh -c 'builtin iconv ; integer i ; typeset prefix="123" ; for ((i=0
> ; i < 2**16 ; i++ )) ; do printf "%s\u[20ac]" "$prefix" ; done | iconv
> -f UTF-8 >xxx'
> -- snip --
> (the string length of "prefix" may have to be varied to catch sfio
> buffers of a different size (I'll write a testcase for the builtin
> iconv later))

Sound similar to the problem I've found with Shift-IJS in comsubst()
located in src/cmd/ksh93/sh/macro.c ... currently I've a patch that
does work around for Shift-IJS multibyte characters only as those can
includes ASCII characters as well. This fact cause trouble if such a
multibyte character is on the buffers boundary.   Now if this becomes
a general problem even for UTF-8 multibyte characters a more general
solution in the ksh buffer handling could be required to avoid such
problems.

The parser/scanner routines used after e.g. comsubst() for such buffers
may wait upto the point where all bytes of a multibyte character has
been copied.

Werner

-- 
  "Having a smoking section in a restaurant is like having
          a peeing section in a swimming pool." -- Edward Burr
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to