On Wed, 13 Mar 2019 20:35:09 +0100 Hiltjo Posthuma <[email protected]> wrote:
Dear Hiltjo, > I don't like mixing of the existing functions with wchar_t. > I think st should (at the very least internally) use utf-8. > > Won't apply. I totally agree with you! Come to think of it, do we really need to compare codepoints here? How about preprocessing worddelimiters and storing the offsets of each beginning of a codepoint? Determining if a certain "lookahead"-byte-sequence is a delimiter then just means traversing through this sequence, which would be highly cache-efficient. The only downside I see is adversarial "wasteful" encodings of codepoints into longer UTF-8-sequences, but if we just want to match the reduced forms, which occur in 99.999% of the cases, we can just do a byte-by-byte comparison which would also be more efficient. The question is always how deep we want to go into the Unicode rabbithole. I am currently working on a self-generating LUT-based grapheme cluster "detector" (it basically says if there is a grapheme-cluster-break between two code-points or not). Doing a sort of preprocessing on the worddelimiters-string and identifying the offsets at which there is a grapheme cluster, you could then go about simply comparing byte-sequences. The downside here is, yet again, ambiguity. There are ways to "normalize" grapheme clusters, but e.g. the ordering of codepoints is not always guaranteed. Anyway, just my 2 cents. The way it is right now works out though and everything regarding the cancerous wide-char-standard has been said. With best regards Laslo -- Laslo Hunhold <[email protected]>
pgpY_lkwN1F87.pgp
Description: PGP signature
