Re: [fltk.development] Unicode character display page

Duncan Gibson Tue, 27 Apr 2010 15:45:07 -0700

Me:
>> I've just opened STR 2348: http://www.fltk.org/str.php?L2348
>> "test/editor fails to display misc/cp1252.txt and can hang"
>>
>> Unfortunately, it looks like this problem didn't exist back in r7400
>> before the big refactoring, but was in the last snapshot, r7513, so
>> it looks like something got zapped in the big refactoring.
>>
>> The question now is: do we revert and rework the changes for STR-2158
>> into the "crappier-but-working" code, or do we stick with, and debug,
>> the current "less-cruft-but-failing" code?


Me:
> I've been trying to debug the "less-cruft-but-failing" code, and at
> first I thought the utf-8 aware code was confused by the raw 0x80-0x9f
> control characters, and getting out of sync as it skips to and fro
> trying to find the start/end of the utf-8 sequence.

OK, I think I've found it. And like Deep Thought's answer to the
big question of Life, the Universe and Everything, I don't think
that you are going to like it.

> But then I ran the editor demo in ddd, caused it to lock/hang, and
> then interrupted it, and found it was in a low-level fl_width() call.
> So methinks there's a missing 0-terminator on a string, or a buffer
> overrun somewhere. It might take some time for me to track it down.

What is happening is that the euro character shown in misc/cp1252.png
corresponds to byte value 0x80, which fl_utf8decode() will map to the
correct U+20AC value if some pre-processor constants are defined.
Similar mappings will be made for other characters in the 0x80-0x9f
range that are "defined" in cp1252 encodings, but are C1 control codes
in ISO 8859-1. Text that is completely encoded as UTF-8 will also
encode 0x80-0x9f as two byte sequences. 0x80-0x9f will never appear
as standalone bytes. Because the top bit is set, they can only be
part of a multibyte sequence. OK, so now I've set the scene.

In the misc/cp1252.txt example, these 0x80-0x9f bytes appear as
standalone bytes in the text. The Fl_Text_{Buffer,Display,Editor}
code iterates through arrays of bytes. If the top bit is not set,
it is plain old ascii: tabs, C0 control codes 0x01-0x1f and DEL 0x7f
are expanded to spaces or readable text. If the top bit is set, it
must be part of a utf-8 encoded sequence. For a two-byte sequence,
the first byte takes the form 110xxxxx and the second 10xxxxxx. The
code sees the first byte and knows how many bytes there are in the
sequence, and increments the index into the byte array appropriately.
The length of the sequence is returned by the fl_utf8len(char c).

In true utf-8 text, fl_utf8len() should only be called on the header
byte, and never on the trailing bytes in the sequence, because they
should be skipped. Calling fl_utf8len() on one of these trailing bytes
is an error, and returns a length of -1.

In the cp1252.txt, the code reads the 0x80-0x9f bytes in isolation:
they are not utf-8 encoded, so there is no header byte. But the code
sees that the top bit is set, so it calls fl_utf8len() on it, and is
given a length of -1, and decrements the index into the byte array.
This occurs in the Fl_Text_Buffer::expand_character() function. As
a result, the code loops forever, starting again from the previous
byte. The same behaviour could occur elsewhere too. I haven't checked.

What we can do, and this is what you won't like, is assume that the
index into the byte array always points to a valid ascii character
in the range 0x01-0x7f, or a cp1252 mapping character 0x80-0x9f if
the appropriate #defines are set, or first byte of a utf-8 sequence.
[this may be a huge code invariant to get right and debug fully!]
We then use fl_utf8decode() and fl_utf8encode() pairs for all bytes
with the top bit set. This would give the 0x80-0x9f mappings to the
equivalent Unicode characters, and expand it to the utf-8 sequence.

Does this sound like I might be on the right track?

For other widgets, i.e. not Fl_Test_{Buffer,Display,Editor}, do we
need to redefine fl_utf8len() so that it returns 1 for utf-8 trailing
bytes, so that we can cope with the cp1252 0x80-0x9f characters?
I suspect that this would also be a major rewrite...

Comments?

Cheers
Duncan
_______________________________________________
fltk-dev mailing list
[email protected]
http://lists.easysw.com/mailman/listinfo/fltk-dev

Re: [fltk.development] Unicode character display page

Reply via email to