Re: [RP] [PATCH 1/3 v2] Limit width of formatted text by characters rather than bytes

Jeremie Courreges-Anglas Sun, 17 Sep 2017 12:38:59 -0700

On Tue, Aug 29 2017, Will Storey <w...@summercat.com> wrote:
> On Mon 2017-08-28 20:50:19 +0200, Jeremie Courreges-Anglas wrote:
>> 
>> Hi Will,
>
> Hi!
>
> Thank you for looking at this.
>
>> First, thanks for your submission.  You're dealing with a known problem.
>> 
>> The direction taken so far in ratpoison was: don't deal with wide
>> characters, only handle UTF-8 in a rather dumb but at least simple way.
>> 
>> Rationale:
>> - the wide characters API has a lot of gotchas.  I won't detail them
>>   here but what to do in case of an invalid sequence often remains an
>>   open question.  Here, I can see that you return a partial length
>>   early.  I'm not sure this is desirable.
>
> I see. I'm not super familiar with the wchar.h API. I was not aware
> ratpoison had functionality for this!
>
> Regarding returning on invalid characters: Another option we could do would
> be to replace them with U+FFFD.
>
>> - UTF-8 is easy and looks like the sanest choice for a multibyte locale.
>>   No offense, but other less commonly used locales are just a pain to
>>   handle.  Think state-dependant encodings.
>
> Well, even with UTF-8 it is not so easy to do everything perfectly! I'm not
> sure how wchar.h deals with there being combining characters in weird spots
> for instance. That might be something to look at if we ever revisited using
> it.
>
>> So while technically speaking the wide characters API looks like the
>> obvious choice, I think its cost is a bit high.  Consistency is good.
>> If we start using the wide chars API somewhere, it should be used in all
>> places where it makes sense.  I'm not sure this is an easy task even in
>> ratpoison. :)
>>
>> Handling only UTF-8 as a multibyte locale, the tentative diff below
>> seems to do the job.  *WARNING*: I have barely tested it with your html
>> testcase.
>> 
>> Feedback / test reports welcome.
>
> Cool! Thanks for writing that. I've tried it out and it works well.


Actually the behavior was rather incorrect.  After my patch,
concat_width was copying up to 'width' bytes, not up to 'width' UTF-8
characters.  This gotcha was caught by your test case.  I find the
latter behavior more useful.

Note that the function only cares about how many characters we concat,
not about their actual width on screen.  I believe this behavior is
reasonable, though.

> I'm in agreement about only worrying about support for UTF-8.
>
> After looking at the UTF8 macros, one thought I have is we could improve
> this to be more conservative about what we accept. For example, only
> consuming two, three, or four bytes when the first byte indicates that is
> appropriate, rather than having no limit. I suppose it depends how far we
> want to go in writing UTF-8 decoding.

So far the direction taken was not to bother validating anything.
Garbage in, garbage out.

> Anyway, I think it is a big improvement as is.
>
> It also still might be good to have a few unit tests. Would you be okay
> with tests in the form of my third patch?

Yep.  I have applied your 3/3 patch, which introduces a testcase, and
made further tweaks.  Namely:
- rename concat_width to something more appropriate
- move it so sbuf.c
- shuffled some code to make it easier to link against it
- move the testcase to its own .c file
- simplify src/Makefile.am accordingly

Feedback welcome.

Thanks,
-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

signature.asc
Description: PGP signature

_______________________________________________
Ratpoison-devel mailing list
Ratpoison-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/ratpoison-devel

Re: [RP] [PATCH 1/3 v2] Limit width of formatted text by characters rather than bytes

Reply via email to