Multibyte support (round 2)

Assaf Gordon Fri, 26 Aug 2016 22:06:07 -0700

Hello,

Attached is a second attempt at adding multibyte support for coreutils.
(continued from 
http://lists.gnu.org/archive/html/coreutils/2016-07/msg00013.html).
Of course this is just a rough draft, basis for discussion - not final in any 
way.


It includes four commits:

1.
New module "multibyte" - system-dependent definitions, and multibyte detection 
code.
Based on code that repeated itself in the current i18n patch.
Also includes "src/multibyte-test" program to test the detection code
(of course not a final location for this executable).

2.
New module "mbbuffer" - provides a convenient interface to reading multibyte
input data (from either fread(3) or read(2)), using fixed-size buffer,
calling mbrtowc and handling all cases. Also takes care of counting 
lines/column positions.
Includes test program "src/mbbuffer-test" to test the buffering code.

3.
The "unorm" program (as previously discussed), now uses "mbbuffer" module
and the code is smaller and cleaner.
Still assumes wchar_t == UCS4 , but see details below regarding
why I think that's an acceptable assumption.

4. 
As a proof-of-concept, 'expand' with initial multibyte support,
with the multibyte code being very similar to the single-byte code.
Currently zero-width glyphs and combining chars are not handled.

====

Regarding wchar_t == UCS:
1. 'unorm' only uses the wchar_t value directly if unicode normalization
   is requested (otherwise, it prints the multibyte octets as-is).
  
2. If normalization is requested, I think it's safe to assume the
   locale is unicode-related (e.g. *unicode* normalization under
   iso88591/shift-jis/Big5/eucJP locales is not meaningful).

3. For now, I'm assuming unicode-supporting locales are de-facto UTF-8, 
   but I suspect this can be relaxed if needed.

And so, the question becomes:
When the locale is "UTF-8", is the internal representation of 'wchar_t'
identical to UCS2 or UCS4 (i.e. unicode code-points).
While the standard explicitly says this can not be assumed,
I think in practice it is always the case.

It is so in glibc and musl-libc,
and in OpenBSD,FreeBSD,NetBSD with "UTF-8" locales (but not in non-utf8 
locales).
In OpenSolaris with unicode locales, wchar_t is UTF-32 
(https://docs.oracle.com/cd/E36784_01/html/E39536/gmwkm.html ).
For AIX, wchar_t is either UCS2 or UCS4 in unicode locales (for 32bit/64bit 
binaries respectively, see 
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_53/com.ibm.aix.nls/doc/nlsgdrf/codeset_over.htm
 )

I'd be very interested to learn about more systems, but I hope this 
un-standardize behavior is prevalent enough to be relied upon.

The current implementation of 'unorm' first checks if 'wchar_t==UCS4', and only 
allows unicode-normalization if it is.

Comments very welcomed,
 - assaf




Assaf Gordon (4):
  build: multibyte: new module
  build: mbbuffer: new module
  unorm: a new program to fix and normalize multibyte files
  expand: add multibyte support

 AUTHORS                            |   1 +
 README                             |   2 +-
 bootstrap.conf                     |   7 +
 build-aux/gen-lists-of-programs.sh |   1 +
 doc/coreutils.texi                 |  20 +-
 man/.gitignore                     |   1 +
 man/local.mk                       |   1 +
 man/unorm.x                        |   4 +
 po/POTFILES.in                     |   1 +
 scripts/git-hooks/commit-msg       |   2 +-
 src/.gitignore                     |   1 +
 src/expand-common.c                |  16 +-
 src/expand-common.h                |   5 +
 src/expand.c                       | 144 ++++++++++-
 src/local.mk                       |  19 +-
 src/mbbuffer-test.c                | 295 +++++++++++++++++++++
 src/mbbuffer.c                     | 305 ++++++++++++++++++++++
 src/mbbuffer.h                     | 176 +++++++++++++
 src/multibyte-test.c               |  92 +++++++
 src/multibyte.c                    | 153 +++++++++++
 src/multibyte.h                    | 101 ++++++++
 src/unorm.c                        | 512 +++++++++++++++++++++++++++++++++++++
 tests/local.mk                     |   2 +
 tests/misc/expand-multibyte.pl     | 106 ++++++++
 tests/misc/expand.pl               |  32 +++
 tests/misc/unorm.pl                | 178 +++++++++++++
 26 files changed, 2171 insertions(+), 6 deletions(-)
 create mode 100644 man/unorm.x
 create mode 100644 src/mbbuffer-test.c
 create mode 100644 src/mbbuffer.c
 create mode 100644 src/mbbuffer.h
 create mode 100644 src/multibyte-test.c
 create mode 100644 src/multibyte.c
 create mode 100644 src/multibyte.h
 create mode 100644 src/unorm.c
 create mode 100755 tests/misc/expand-multibyte.pl
 create mode 100755 tests/misc/unorm.pl

multibyte-2016-08-27.patch.xz
Description: Binary data

Multibyte support (round 2)

Reply via email to