Hello, I think that we've handled the low-hanging fruits (e.g. expand/cut/fold) when it comes to multibyte support in coreutils. The remaining programs (e.g. sort,join,uniq,tr,od) present some challenges - both in terms of what is the 'correct' (and useful) behavior, and in terms of implementation.
I also think a common thread is the combination of these three requirements: 1. Invalid sequences must be handled as single-bytes 2. Can't rely on native wchar_t (e.g. for cygwin) without extra work 3. Can't assume UTF-8 (or even unicode). Each requirement by itself is not too problematic - but combined they make a portable and efficient implementation quite cumbersome. I'd like to ask a heretical question: what if we can relax these requirements ? specifically, what if we can agree that on systems where wchar_t is not sufficient, we only support UTF8 (and thus use gnulib's internal fast implementations)? (I would love to suggest to support only utf8 everywhere, but I'm sure this would not be accepted...) I will continue to work on multibyte support in any case, but I think it will make things much better if we are not tied by these (legacy?) issues. With a bit of hand-waving, wouldn't it be reasonable to say that the largest portion of GNU coreutils users have systems that have both useable wchar_t *and* work primarily in UTF-8 ? At the risk of mixing apples and oranges, checking the encoding for web-sites shows that UTF-8 is clearly dominating over time: https://w3techs.com/technologies/details/en-utf8/all/all http://pinyin.info/news/2015/utf-8-unicode-vs-other-encodings-over-time/ I know coreutils is not meant for the web, but I hope that it does hint that UTF-8 is gaining popularity not only in websites. Looking at other implementations, some chose to switch to UTF-8 completely (e.g. OpenBSD-6, or Linux with musl-libc). Others have useable wchar_t and support multibyte processing for a long time (e.g. FreeBSD, Mac OS X). I have skimmed through past mailing-list discussions, and Eric has been replying since about 2006 saying essentially "if someone comes up with efficient implementation we'll add it" - but despite many attempts - we still don't have it. It won't be a regression for these few limited systems - because currently coreutils doesn't provide any multibyte support. Lastly, I've arranged my notes into a web page. I hope these notes will save some time if others are interested in catching-up to the multibyte issue (except for the time it'll take to read my notes (-: ) : http://crashcourse.housegordon.org/coreutils-multibyte-support.html I'm happy to hear comments and feedback. Thanks for reading so far, - assaf
