On Wed, 13 Mar 2019 20:35:09 +0100
Hiltjo Posthuma <[email protected]> wrote:

Dear Hiltjo,

> I don't like mixing of the existing functions with wchar_t.
> I think st should (at the very least internally) use utf-8.
> 
> Won't apply.

I totally agree with you! Come to think of it, do we really need to
compare codepoints here? How about preprocessing worddelimiters and
storing the offsets of each beginning of a codepoint? Determining if a
certain "lookahead"-byte-sequence is a delimiter then just means
traversing through this sequence, which would be highly cache-efficient.

The only downside I see is adversarial "wasteful" encodings of
codepoints into longer UTF-8-sequences, but if we just want to match
the reduced forms, which occur in 99.999% of the cases, we can just do
a byte-by-byte comparison which would also be more efficient.

The question is always how deep we want to go into the Unicode
rabbithole. I am currently working on a self-generating LUT-based
grapheme cluster "detector" (it basically says if there is a
grapheme-cluster-break between two code-points or not). Doing a sort of
preprocessing on the worddelimiters-string and identifying the offsets
at which there is a grapheme cluster, you could then go about simply
comparing byte-sequences.

The downside here is, yet again, ambiguity. There are ways to
"normalize" grapheme clusters, but e.g. the ordering of codepoints is
not always guaranteed.

Anyway, just my 2 cents. The way it is right now works out though and
everything regarding the cancerous wide-char-standard has been said.

With best regards

Laslo

-- 
Laslo Hunhold <[email protected]>

Attachment: pgpY_lkwN1F87.pgp
Description: PGP signature

Reply via email to