On Fri, 15 Mar 2019 08:27:56 +0200 Lauri Tirkkonen <[email protected]> wrote:
Dear Lauri, > I don't understand your logic. The current solution *is* converting > everything to a Rune. > > static char *utf8strchr(char *, Rune); > > worddelimiters is char *, but utf8strchr() calls utf8decode() on it to > obtain Runes (to compare to the second argument). While I don't think > efficiency actually matters a lot here since this is only called when > you double-click to select something, Jules' solution is quite similar > to mine in that the worddelimiters string needs no conversion at > runtime, and therefore more efficient than the current one. yes, sorry for that. I noticed after sending that my wording is unclear. Of course utf8strchr() does an in-situ Rune conversion, but your solution requires passing a Rune-array to utf8strchr(), implying that besides converting you would also have to _store_ the Runes somewhere. > > Now, to clear it up: A Rune literally is only a codepoint and just a > > typedef for an (at least) 32-bit-integer. > > Yes, and yet Rune values are still being passed to wcwidth() in the > current code. You objected to wchar_t on grounds of portability, but > already the current code is broken on platforms where wchar_t is less > than 32 bits, or its values do not match Unicode codepoints. I hope > you will not suggest replacing wcwidth() with an application-local > character width table. wcwidth() is fundamentally broken, given the assumption that 1 codepoint = 1 character (or grapheme if you prefer Unicode-newspeak) is _wrong_. The discussion on how far we want to support Unicode has been going on for years and is a difficult call. Standards move very slowly and I see no way around doing it ourselves one way or another. The grapheme-cluster-boundary-detection I talked about earlier uses awk(1) to generate the rules automatically from the machine-readable unicode-standard-table, converting them to LUTs. For width-calculation on grapheme clusters, it's more difficult, but not impossible. Usually, grapheme clusters are made up of base characters (half or full width) with modifiers, so something along the lines of [0] with automatically-generated LUTs would be ideal. Before the question comes up: ICU should be avoided like the plague, given it encompasses all locales and is very bloated in nature. There is a notion of a "common denominator" in Unicode, which is locale independent, and that's what we should go with. But please, stop pretending that the standard is in any way even closely capable of handling Unicode. It isn't and it needs an overhaul. UTF-8 is a sane default. We can compose codepoints on top of that and then compose grapheme clusters, for which we can make educated estimations of their drawing width. Everything else is just a hack and doesn't approach the problem wholeheartedly. With best regards Laslo [0]:https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c -- Laslo Hunhold <[email protected]>
pgpXsTvi9POVE.pgp
Description: PGP signature
