2018-07-22 6:43 GMT+08:00 Bruno Haible <br...@clisp.org>: > Hi Pádraig, > >> I've attached a gnulib patch to document for iscntrl at least. > >> +This function does not support arguments outside of the range of the >> +unsigned char type in locales with large character sets, on some platforms. >> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales. > > In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl(). > > POSIX [1] says > "The c argument is a type int, the value of which the application shall > ensure is a character representable as an unsigned char or equal to the > value of the macro EOF. If the argument has any other value, the behavior > is undefined." > > The term "character" is defined here [2]: > "A sequence of one or more bytes representing a single graphic symbol or > control code." > > So, in a UTF-8 locale, a "character representable as an unsigned char" > is a byte sequence of length 1, where the single byte has a value in the > range 0x00..0x7F. > > For invalid values "the behavior is undefined." You were expecting a value 0. > > Now, in the gnulib documentations, what we mention as portability problems > are the cases where > - the behaviour for valid arguments is different on different platforms, or > - the boundary between valid and invalid arguments is fuzzy and depends on > the platform. > IMO there's no point in documenting that a function _really_ has undefined > behaviour when POSIX says that it has undefined behaviour. > >> I've also attached an alternative patch for df (in your name). > > This patch is correct (because the characters that you test for in c_iscntrl > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings). > > But it does not catch control characters outside of the ASCII range. It would > make sense to catch these as well. If you want to do that, > 'hide_problematic_chars' needs to be rewritten as a loop that iterates across > the multibyte characters. For example with the 'mbiter' module, in > combination with the mb_iscntrl function from the 'mbchar' module. Or > directly with mbrtowc() and iswcntrl(). > > Bruno > > [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html > [2] > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87
The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me if you want me to test other patches, thanks! Cheers, Chih-Hsuan Yen