Hello Eric, On 08/29/2016 01:13 PM, Eric Blake wrote: > On 08/27/2016 12:05 AM, Assaf Gordon wrote: >> Regarding wchar_t == UCS: > But not in Cygwin, where wchar_t is 2 bytes, and where Cygwin already > supports surrogate pairs in wchar_t to represent Unicode characters > beyond 0xffff
Thank you for mentioning this. On AIX-32bit wchar_t is also 2bytes, but I'm not sure if UCS2 or just BMP. I can think of few options: 1. Process entire lines, keep them in-memory as multibyte strings in the current locale, then use gnulib's unicode-normalization functions take take an entire string (e.g. u8_normalize). (This was the initial implementation, in http://lists.gnu.org/archive/html/coreutils/2016-07/msg00018.html ). 2. Detect such systems (where wchar_t==UCS2 or BMP) in runtime or at configuration time, and then either: 2.1: issue a warning if the input is beyond BMP (meaning partial unicode normaliation support on such systems) 2.2: add additional code to convert UCS-2 surrogate pairs into UCS4 3. Decide not to support unicode normalization on such systems (beyond what 'just works' with BMP characters). Comments welcomed, - assaf
