On 23/02/2022 17:55, Pádraig Brady wrote:
I think isspace(x85) returning true on macOS is a bug,
Bug is a bit of a strong word here. A digression into why 0x85 is being treated specially here. Note Cyrillic kha "х" is encoded in UTF-8 as: $ printf '\u0445' | od -tx1 0000000 d1 85 What I think is happening is \u0085 represents "Next Line" in unicode. This is present in unicode to support mapping to/from the corresponding char in EBCDIC, which had a distinct char for this in addition to CR and LF. Given isspace('\n') returns true, then it makes some sense that isspace("Next Line") would return true, and I guess through implementation details isspace(int) is operating on utf32 on macOS in UTF-8 locales and this returning true for this value. BTW 0xA0 is the only other value that isspace() returns true for (other than the standard c_isspace() values of course). This is non breaking space, so it's best we don't split on it anyway. I.e. this is another benefit to the change. I still think using c_isspace() to avoid this issue is best, and intend to push the change tomorrow. cheers, Pádraig