Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"

Assaf Gordon Mon, 12 Aug 2019 15:03:44 -0700

Hello,

On Mon, Aug 12, 2019 at 09:19:54PM +0200, [email protected] wrote:
> I have partially implemented the option "-c" ("--characters") for UTF-8
> non-ASCII characters[...]


First and foremost,
Thank you for taking the time and effort to develop new features and
send them to the mailing list.

> This implementation has two, somewhat important shortcomings:
>
> * Other encodings are not implemented.
> [...] I decided to stick with just UTF-8.

At this point in time, this limitation is a show-stopper.
A multibyte-aware implementation for GNU coreutils (for all programs,
not just for 'cut') should support all native encodings.

Ostensibly, this should be implementated using the standard
mbrtowc(3)/mbstowcs(3) family of functions - but in reality there is
another complication - a good implementation should also support
systems where 'char_t' is limited to 16bit (instead of 32bit),
and therefore require handling of unicode surrogate pairs.

You can read more about the programs (and past suggested solutions) here
https://crashcourse.housegordon.org/coreutils-multibyte-support.html

(as a side node to other readers: if these are not a show-stopper
requirements any longer, please chime in - this will make things much
easier.)

> * Modifier characters are treated as individual characters [...]
> Decisively, many languages from Western Europe (Spanish,
> Portuguese...) might or might not work with this program, depending on
> which kind of accented letters are produced [...]

I see two related but separate issues here.

The first is generally called "unicode normalization", e.g.
if the user sees the letter "A" with acute accent, is it encoded as one
unicode character (U+00C1, "Latin Capital Letter A with Acute")
or two unicode characters ("A" followed by U+0301 "Combining Acute
Accent").

This issue is not a problem (in the sense that it's OK if cut treats
"A" followed by U+0301 as separate characters) - because we will also
include an additional program that can convert from one form to the
other (called "unorm" in the URL mentioned above).

The second interesting issue are the (new?) modifiers such as
the U+1F3FB "EMOJI MODIFIER FITZPATRICK" 
(http://unicode.org/reports/tr51/#Diversity
https://codepoints.net/U+1F3FB)
that affect other characters.
Here I don't see a easy way to know if characters should be grouped,
and they should probably be treated as separate characters in all cases.


> On the other hand, missing bytes in a multibyte UTF-8 characters are 
> correctly handled
[...]
> It is my hope that you should find this first approach to the problem 
> sufficient for most uses, and incorporate it into the mainstream code.

I would say that your approach of dealing only with UTF-8 has some merits
(i.e., as a "fast path" in parallel to slower mbrtowc(3) part,
and the faster unibyte path).
I suspect that if we do go down that road, it'll be better to use
gnulib's already implemented UTF-8 code (and also UTF-16/UTF-32) instead
of adding ad-hoc functions.

> (Should my modifications be big enough to require it for copyright
> reasons, my name is "Jaime Mosquera", and I obviously agree to the
> terms of the GNU GPL.)

Thank you - that is indeed the gist (copyright assignment is needed from
contributors), but the technicalities are slightly different.

We ask that contributors fill and send the following form:
  
https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future
explained 'why?' here: https://www.gnu.org/licenses/why-assign.en.html

regards,
 - assaf

Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"

Reply via email to