Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"
I will reopen the can of worms of again offering my own multibyte cut (and other coreutils) if the maintainers ever decide they want it: https://github.com/ericfischer/coreutils/blob/multibyte-squash/src/cut.c I think the normalization ambiguity here is resolved by the POSIX standard's distinction between "characters" and "printing positions." Combining accents are characters but do not advance the printing position. Eric On Mon, Aug 12, 2019 at 3:03 PM Assaf Gordon wrote: > Hello, > > On Mon, Aug 12, 2019 at 09:19:54PM +0200, jaime.mosqu...@tutanota.com > wrote: > > I have partially implemented the option "-c" ("--characters") for UTF-8 > > non-ASCII characters[...] > > First and foremost, > Thank you for taking the time and effort to develop new features and > send them to the mailing list. > > > This implementation has two, somewhat important shortcomings: > > > > * Other encodings are not implemented. > > [...] I decided to stick with just UTF-8. > > At this point in time, this limitation is a show-stopper. > A multibyte-aware implementation for GNU coreutils (for all programs, > not just for 'cut') should support all native encodings. > > Ostensibly, this should be implementated using the standard > mbrtowc(3)/mbstowcs(3) family of functions - but in reality there is > another complication - a good implementation should also support > systems where 'char_t' is limited to 16bit (instead of 32bit), > and therefore require handling of unicode surrogate pairs. > > You can read more about the programs (and past suggested solutions) here > https://crashcourse.housegordon.org/coreutils-multibyte-support.html > > (as a side node to other readers: if these are not a show-stopper > requirements any longer, please chime in - this will make things much > easier.) > > > * Modifier characters are treated as individual characters [...] > > Decisively, many languages from Western Europe (Spanish, > > Portuguese...) might or might not work with this program, depending on > > which kind of accented letters are produced [...] > > I see two related but separate issues here. > > The first is generally called "unicode normalization", e.g. > if the user sees the letter "A" with acute accent, is it encoded as one > unicode character (U+00C1, "Latin Capital Letter A with Acute") > or two unicode characters ("A" followed by U+0301 "Combining Acute > Accent"). > > This issue is not a problem (in the sense that it's OK if cut treats > "A" followed by U+0301 as separate characters) - because we will also > include an additional program that can convert from one form to the > other (called "unorm" in the URL mentioned above). > > The second interesting issue are the (new?) modifiers such as > the U+1F3FB "EMOJI MODIFIER FITZPATRICK" ( > http://unicode.org/reports/tr51/#Diversity > https://codepoints.net/U+1F3FB) > that affect other characters. > Here I don't see a easy way to know if characters should be grouped, > and they should probably be treated as separate characters in all cases. > > > > On the other hand, missing bytes in a multibyte UTF-8 characters are > correctly handled > [...] > > It is my hope that you should find this first approach to the problem > sufficient for most uses, and incorporate it into the mainstream code. > > I would say that your approach of dealing only with UTF-8 has some merits > (i.e., as a "fast path" in parallel to slower mbrtowc(3) part, > and the faster unibyte path). > I suspect that if we do go down that road, it'll be better to use > gnulib's already implemented UTF-8 code (and also UTF-16/UTF-32) instead > of adding ad-hoc functions. > > > (Should my modifications be big enough to require it for copyright > > reasons, my name is "Jaime Mosquera", and I obviously agree to the > > terms of the GNU GPL.) > > Thank you - that is indeed the gist (copyright assignment is needed from > contributors), but the technicalities are slightly different. > > We ask that contributors fill and send the following form: > > https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future > explained 'why?' here: https://www.gnu.org/licenses/why-assign.en.html > > regards, > - assaf > >
Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"
Hello, On Mon, Aug 12, 2019 at 09:19:54PM +0200, jaime.mosqu...@tutanota.com wrote: > I have partially implemented the option "-c" ("--characters") for UTF-8 > non-ASCII characters[...] First and foremost, Thank you for taking the time and effort to develop new features and send them to the mailing list. > This implementation has two, somewhat important shortcomings: > > * Other encodings are not implemented. > [...] I decided to stick with just UTF-8. At this point in time, this limitation is a show-stopper. A multibyte-aware implementation for GNU coreutils (for all programs, not just for 'cut') should support all native encodings. Ostensibly, this should be implementated using the standard mbrtowc(3)/mbstowcs(3) family of functions - but in reality there is another complication - a good implementation should also support systems where 'char_t' is limited to 16bit (instead of 32bit), and therefore require handling of unicode surrogate pairs. You can read more about the programs (and past suggested solutions) here https://crashcourse.housegordon.org/coreutils-multibyte-support.html (as a side node to other readers: if these are not a show-stopper requirements any longer, please chime in - this will make things much easier.) > * Modifier characters are treated as individual characters [...] > Decisively, many languages from Western Europe (Spanish, > Portuguese...) might or might not work with this program, depending on > which kind of accented letters are produced [...] I see two related but separate issues here. The first is generally called "unicode normalization", e.g. if the user sees the letter "A" with acute accent, is it encoded as one unicode character (U+00C1, "Latin Capital Letter A with Acute") or two unicode characters ("A" followed by U+0301 "Combining Acute Accent"). This issue is not a problem (in the sense that it's OK if cut treats "A" followed by U+0301 as separate characters) - because we will also include an additional program that can convert from one form to the other (called "unorm" in the URL mentioned above). The second interesting issue are the (new?) modifiers such as the U+1F3FB "EMOJI MODIFIER FITZPATRICK" (http://unicode.org/reports/tr51/#Diversity https://codepoints.net/U+1F3FB) that affect other characters. Here I don't see a easy way to know if characters should be grouped, and they should probably be treated as separate characters in all cases. > On the other hand, missing bytes in a multibyte UTF-8 characters are > correctly handled [...] > It is my hope that you should find this first approach to the problem > sufficient for most uses, and incorporate it into the mainstream code. I would say that your approach of dealing only with UTF-8 has some merits (i.e., as a "fast path" in parallel to slower mbrtowc(3) part, and the faster unibyte path). I suspect that if we do go down that road, it'll be better to use gnulib's already implemented UTF-8 code (and also UTF-16/UTF-32) instead of adding ad-hoc functions. > (Should my modifications be big enough to require it for copyright > reasons, my name is "Jaime Mosquera", and I obviously agree to the > terms of the GNU GPL.) Thank you - that is indeed the gist (copyright assignment is needed from contributors), but the technicalities are slightly different. We ask that contributors fill and send the following form: https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future explained 'why?' here: https://www.gnu.org/licenses/why-assign.en.html regards, - assaf
[Implemented] [coreutils] Partial UTF-8 support for "cut -c"
Good evening. I have partially implemented the option "-c" ("--characters") for UTF-8 non-ASCII characters, so that using a text in any language other than English does not result in rather subtle bugs ("cut -c 1-79" produces 79 characters, except that lines with one accented letter are one character short; furthermore, depending on where you cut, you may get "partial", unprintable characters). My modifications are attached as a patch file (created through git) to the last version found on GitHub (as cloned earlier today). This implementation has two, somewhat important shortcomings: * Other encodings are not implemented. It should not be too difficult to implement UTF-16, and UTF-32 definitely less so, but branching between them would make the code a bit more difficult to understand and require a simple way to detect the current encoding and act accordingly. Furthermore, more encodings would be needed (Japan still uses non-Unicode encodings with some frequency), so I decided to stick with just UTF-8. * Modifier characters are treated as individual characters, instead of being processed along with the characters they modify, as Unicode dictates. Decisively, many languages from Western Europe (Spanish, Portuguese...) might or might not work with this program, depending on which kind of accented letters are produced (on my computer it worked perfectly). On the other hand, missing bytes in a multibyte UTF-8 characters are correctly handled (the incomplete character is printed, but the next character is read whole, without misreading any bytes as part of the previous character). It is my hope that you should find this first approach to the problem sufficient for most uses, and incorporate it into the mainstream code. Greetings. (Should my modifications be big enough to require it for copyright reasons, my name is "Jaime Mosquera", and I obviously agree to the terms of the GNU GPL.) diff --git a/src/cut.c b/src/cut.c index bb2e641f7..8f156ad78 100644 --- a/src/cut.c +++ b/src/cut.c @@ -80,6 +80,9 @@ enum operating_mode /* Output characters that are in the given bytes. */ byte_mode, +/* Output characters that are in the given characters. */ +char_mode, + /* Output the given delimiter-separated fields. */ field_mode }; @@ -137,6 +140,40 @@ static struct option const longopts[] = {NULL, 0, NULL, 0} }; + +static +int getUTF8 (FILE* stream) +{ + int c, ch; + int n, i; + + c = getc (stream); + if (c == EOF) +return c; + if ((c >> 5) == 6) +n = 1; + else if ((c >> 4) == 14) +n = 2; + else if ((c >> 3) == 30) +n = 3; + else +n = 0; + + for (i = 0; i < n; i++) + { +ch = getc (stream); +if ((ch >> 6) == 2) + c = (c << 8) + ch; +else +{ + ungetc (ch, stream); + break; +} + } + + return c; +} + void usage (int status) { @@ -280,6 +317,71 @@ cut_bytes (FILE *stream) } } + +/* Read from stream STREAM, printing to standard output any selected characters. */ + +static void +cut_chars (FILE *stream) +{ + uintmax_t char_idx; /* Number of bytes in the line so far. */ + /* Whether to begin printing delimiters between ranges for the current line. + Set after we've begun printing data corresponding to the first range. */ + bool print_delimiter; + + char_idx = 0; + print_delimiter = false; + current_rp = frp; + while (true) +{ + int c; /* Each character from the file. */ + unsigned int ch; + int i; + char str[5]; + + c = getUTF8 (stream); + // c = getc (stream); + + if (c == line_delim) +{ + putchar (c); + char_idx = 0; + print_delimiter = false; + current_rp = frp; +} + else if (c == EOF) +{ + if (char_idx > 0) +putchar (line_delim); + break; +} + else +{ + ch = *(unsigned int*) &c; + next_item (&char_idx); + if (print_kth (char_idx)) +{ + if (output_delimiter_specified) +{ + if (print_delimiter && is_range_start_index (char_idx)) +{ + fwrite (output_delimiter_string, sizeof (char), + output_delimiter_length, stdout); +} + print_delimiter = true; +} + + for (i = 3; i >= 0; i--, ch /= 256) + str[i] = ch % 256; + str[4] = 0; + + for (i = 0; i < 4; i++) + if (str[i] != 0) + putchar ((unsigned char) str[i]); +} +} +} +} + /* Read from stream STREAM, printing to standard output any selected fields. */ static void @@ -430,6 +532,8 @@ cut_stream (FILE *stream) { if (operating_mode == byte_mode) cut_bytes (stream); + else if (operating_mode == char_mode) +cut_chars (stream); else cut_fields (stream); } @@ -505,7 +609,6 @@ main (int argc, char