Hello, Attached is a second attempt at adding multibyte support for coreutils. (continued from http://lists.gnu.org/archive/html/coreutils/2016-07/msg00013.html). Of course this is just a rough draft, basis for discussion - not final in any way.
It includes four commits: 1. New module "multibyte" - system-dependent definitions, and multibyte detection code. Based on code that repeated itself in the current i18n patch. Also includes "src/multibyte-test" program to test the detection code (of course not a final location for this executable). 2. New module "mbbuffer" - provides a convenient interface to reading multibyte input data (from either fread(3) or read(2)), using fixed-size buffer, calling mbrtowc and handling all cases. Also takes care of counting lines/column positions. Includes test program "src/mbbuffer-test" to test the buffering code. 3. The "unorm" program (as previously discussed), now uses "mbbuffer" module and the code is smaller and cleaner. Still assumes wchar_t == UCS4 , but see details below regarding why I think that's an acceptable assumption. 4. As a proof-of-concept, 'expand' with initial multibyte support, with the multibyte code being very similar to the single-byte code. Currently zero-width glyphs and combining chars are not handled. ==== Regarding wchar_t == UCS: 1. 'unorm' only uses the wchar_t value directly if unicode normalization is requested (otherwise, it prints the multibyte octets as-is). 2. If normalization is requested, I think it's safe to assume the locale is unicode-related (e.g. *unicode* normalization under iso88591/shift-jis/Big5/eucJP locales is not meaningful). 3. For now, I'm assuming unicode-supporting locales are de-facto UTF-8, but I suspect this can be relaxed if needed. And so, the question becomes: When the locale is "UTF-8", is the internal representation of 'wchar_t' identical to UCS2 or UCS4 (i.e. unicode code-points). While the standard explicitly says this can not be assumed, I think in practice it is always the case. It is so in glibc and musl-libc, and in OpenBSD,FreeBSD,NetBSD with "UTF-8" locales (but not in non-utf8 locales). In OpenSolaris with unicode locales, wchar_t is UTF-32 (https://docs.oracle.com/cd/E36784_01/html/E39536/gmwkm.html ). For AIX, wchar_t is either UCS2 or UCS4 in unicode locales (for 32bit/64bit binaries respectively, see https://www.ibm.com/support/knowledgecenter/en/ssw_aix_53/com.ibm.aix.nls/doc/nlsgdrf/codeset_over.htm ) I'd be very interested to learn about more systems, but I hope this un-standardize behavior is prevalent enough to be relied upon. The current implementation of 'unorm' first checks if 'wchar_t==UCS4', and only allows unicode-normalization if it is. Comments very welcomed, - assaf Assaf Gordon (4): build: multibyte: new module build: mbbuffer: new module unorm: a new program to fix and normalize multibyte files expand: add multibyte support AUTHORS | 1 + README | 2 +- bootstrap.conf | 7 + build-aux/gen-lists-of-programs.sh | 1 + doc/coreutils.texi | 20 +- man/.gitignore | 1 + man/local.mk | 1 + man/unorm.x | 4 + po/POTFILES.in | 1 + scripts/git-hooks/commit-msg | 2 +- src/.gitignore | 1 + src/expand-common.c | 16 +- src/expand-common.h | 5 + src/expand.c | 144 ++++++++++- src/local.mk | 19 +- src/mbbuffer-test.c | 295 +++++++++++++++++++++ src/mbbuffer.c | 305 ++++++++++++++++++++++ src/mbbuffer.h | 176 +++++++++++++ src/multibyte-test.c | 92 +++++++ src/multibyte.c | 153 +++++++++++ src/multibyte.h | 101 ++++++++ src/unorm.c | 512 +++++++++++++++++++++++++++++++++++++ tests/local.mk | 2 + tests/misc/expand-multibyte.pl | 106 ++++++++ tests/misc/expand.pl | 32 +++ tests/misc/unorm.pl | 178 +++++++++++++ 26 files changed, 2171 insertions(+), 6 deletions(-) create mode 100644 man/unorm.x create mode 100644 src/mbbuffer-test.c create mode 100644 src/mbbuffer.c create mode 100644 src/mbbuffer.h create mode 100644 src/multibyte-test.c create mode 100644 src/multibyte.c create mode 100644 src/multibyte.h create mode 100644 src/unorm.c create mode 100755 tests/misc/expand-multibyte.pl create mode 100755 tests/misc/unorm.pl
multibyte-2016-08-27.patch.xz
Description: Binary data
