[BUG] Improper 8-bit parsing because of signed overflow

max ulidtko Mon, 17 Jan 2011 07:00:58 -0800

In parser.c there is a function readtoken1() which fails to properly
parse some 8-bit (i.e. UTF-8) tokens (filenames). Consider the following
test:


$ cat < тест
sh: cannot open те�т: No such file
$ echo "тест" | od -b
0000000 321 202 320 265 321 201 321 202 012
0000011

Here "тест" is four Cyrillic characters which get encoded in 8 bytes of
UTF-8. The third character (sixth byte, \201, to be exact) fails to be
parsed by dash.

The reason is signed overflow. The parser uses syntax tables to
determine the class to which a given byte (assuming it's a whole
character) belongs. The lookup is done like this:
        switch(syntax[c]) {
But the variable c is declared as int. So instead of looking up the
character \201 (129 in decimal) the parser uses signed index -127 to
look up garbage which happens to be not equal to 0 (==CWORD). As a
result, the output token becomes corrupted. 

Here is some gdb output:
(gdb) next
884                             switch(syntax[c]) {
8: syntax[c] = 12 '\f'
7: out = 0x8061659 ""
6: stacknxt = 0x8061654 "те", <incomplete sequence \321>
5: (char)c = -127 '\201'
(gdb) print syntax[129]
$42 = 0 '\000'
(gdb) print syntax[(unsigned char)c]
$43 = 0 '\000'
(gdb) print syntax[c]
$44 = 12 '\f'

I would note that *any* 8-bit characters are being looked up in syntax
tables incorrectly. Though only some cases lead to user-visible
breakage, this is definitely a bug which needs to be fixed.

But, due to the too lowlevel-ish style of the code I was unable to
figure out working fix. My first suggested change to syntax[(unsigned
char)c] didn't work.


------
Regards,
max ulidtko

--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[BUG] Improper 8-bit parsing because of signed overflow

Reply via email to