Re: [ast-developers] Hang when counting invalid character byte sequence in GB18030...

Glenn Fowler Sat, 07 Sep 2013 07:49:40 -0700

found the solaris iconv problem in 5 min after sleeping on it
the following command sequences use native commands -- no ast involved


# u.dat is a UTF-32LE file containing <lower-case-u-umlaut><newline> #

$ od -tx1 u.dat
0000000 dc 00 00 00 0a 00 00 00
0000010

# on linux.i386-64
$ /usr/bin/iconv -f UTF-32LE -t US-ASCII < u.dat
/usr/bin/iconv: illegal input sequence at position 0
$ echo $?
1

# on sol11.i386
$ /bin/iconv -f UTF-32LE -t US-ASCII < u.dat
?
$ echo $?
0

solaris is *bad* in at least 3 ways
* it apparently detects a conversion error but does not issue a diagnostic
* it apparently detects a conversion error and substitutes '?' for "bad" bytes
* it apparently detects a conversion error but exits 0

who know what liberties other implementations may take

I wonder if ast, in the C/POSIX locale and MB_CUR_MAX==1, should have
strict and non-strict conformance modes

strict: US-ASCII: characters are 7 bit bytes, bytes with bit 0x80 set are 
invalid
non-strict: ISO-8859-1: charcters are 8 bit bytes

non-strict would match linux C locale behavior
strict would match whose behavior?

I believe posix gives wiggle room here for the C locale to have chars with bit 
0x80 set
ast in non-strict mode will simply apply that wiggle room constsitenly across
all of its os/arch implementations

I guess what I'm really saying is that ast *will* be consistent across all 
implementations

the question then is: in the C locale is the ast behavior always strict or is 
it tempered
by astconf("COMFORMANCE")?

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Hang when counting invalid character byte sequence in GB18030...

Reply via email to