On Thu, Mar 14, 2019 at 09:57:02AM +0200, Lauri Tirkkonen wrote: > Hi, > > On Wed, Mar 13 2019 20:35:09 +0100, Hiltjo Posthuma wrote: > > I don't like mixing of the existing functions with wchar_t. > > I think st should (at the very least internally) use utf-8. > > I think I explained my position poorly, so let me try to clarify. > My apologies if this seems a bit pushy :) > > First - I agree with using UTF-8. That's actually how I ended up with > this diff -- I was trying to configure U+3000 IDEOGRAPHIC SPACE as a > delimiter, but seeing that worddelimiters was char *, I started > wondering whether I could actually use unicode characters in it and had > to go read the code, thus finding utf8strchr(). > > utf8strchr() is a bit peculiar - on every call to ISDELIM(), it decodes > the worddelimiters utf-8 string into Runes (so that it can compare to > the Rune argument). It seems a little strange to me to be doing that -- > the delimiters string cannot change at runtime, so storing the > codepoints instead of the multibyte string feels like a better fit. And > that's what wchar_t * is, with the added bonus that we can use libc > wcschr() instead of rolling our own search function. > > I already mentioned that Rune is being passed to wcwidth(wchar_t), so it > seems like there is a builtin assumption that Rune and wchar_t hold > equivalent values. I actually don't understand why that typedef exists > instead of just using wchar_t; maybe I'm missing something. > > Could you explain what it is that you don't like about wchar_t? >
Hi, I've applied both of the patches and a small change to the default worddelimiters. Thanks for the clarifications. The codepoint assumption was indeed wrong. I do not mind wchar_t, but in practise it is not consistent across platforms. However we already use wchar_t in st so it should be as correct as possible matching the POSIX standard. (@Laslo) for simplicity/sanity sake I think assuming 1 codepoint is 1 "character" makes sense. Thanks, -- Kind regards, Hiltjo
