On Tue, Aug 29 2017, Will Storey <w...@summercat.com> wrote: > On Mon 2017-08-28 20:50:19 +0200, Jeremie Courreges-Anglas wrote: >> >> Hi Will, > > Hi! > > Thank you for looking at this. > >> First, thanks for your submission. You're dealing with a known problem. >> >> The direction taken so far in ratpoison was: don't deal with wide >> characters, only handle UTF-8 in a rather dumb but at least simple way. >> >> Rationale: >> - the wide characters API has a lot of gotchas. I won't detail them >> here but what to do in case of an invalid sequence often remains an >> open question. Here, I can see that you return a partial length >> early. I'm not sure this is desirable. > > I see. I'm not super familiar with the wchar.h API. I was not aware > ratpoison had functionality for this! > > Regarding returning on invalid characters: Another option we could do would > be to replace them with U+FFFD. > >> - UTF-8 is easy and looks like the sanest choice for a multibyte locale. >> No offense, but other less commonly used locales are just a pain to >> handle. Think state-dependant encodings. > > Well, even with UTF-8 it is not so easy to do everything perfectly! I'm not > sure how wchar.h deals with there being combining characters in weird spots > for instance. That might be something to look at if we ever revisited using > it. > >> So while technically speaking the wide characters API looks like the >> obvious choice, I think its cost is a bit high. Consistency is good. >> If we start using the wide chars API somewhere, it should be used in all >> places where it makes sense. I'm not sure this is an easy task even in >> ratpoison. :) >> >> Handling only UTF-8 as a multibyte locale, the tentative diff below >> seems to do the job. *WARNING*: I have barely tested it with your html >> testcase. >> >> Feedback / test reports welcome. > > Cool! Thanks for writing that. I've tried it out and it works well.
Actually the behavior was rather incorrect. After my patch, concat_width was copying up to 'width' bytes, not up to 'width' UTF-8 characters. This gotcha was caught by your test case. I find the latter behavior more useful. Note that the function only cares about how many characters we concat, not about their actual width on screen. I believe this behavior is reasonable, though. > I'm in agreement about only worrying about support for UTF-8. > > After looking at the UTF8 macros, one thought I have is we could improve > this to be more conservative about what we accept. For example, only > consuming two, three, or four bytes when the first byte indicates that is > appropriate, rather than having no limit. I suppose it depends how far we > want to go in writing UTF-8 decoding. So far the direction taken was not to bother validating anything. Garbage in, garbage out. > Anyway, I think it is a big improvement as is. > > It also still might be good to have a few unit tests. Would you be okay > with tests in the form of my third patch? Yep. I have applied your 3/3 patch, which introduces a testcase, and made further tweaks. Namely: - rename concat_width to something more appropriate - move it so sbuf.c - shuffled some code to make it easier to link against it - move the testcase to its own .c file - simplify src/Makefile.am accordingly Feedback welcome. Thanks, -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
signature.asc
Description: PGP signature
_______________________________________________ Ratpoison-devel mailing list Ratpoison-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/ratpoison-devel