Re: [fltk.development] RFC: Pure UTF-8 or Hybrid CP1252 ?

Duncan Gibson Tue, 23 Nov 2010 15:30:06 -0800

Albrecht:
>> I'm thinking of test/editor: can/should it check the text buffer after
>> reading a file? Sometimes you don't know a file's encoding before
>> you open it, and what about the file chooser's preview? Could it
>> crash a user's program if s/he looks at a file in CP-1252 encoding?


Matt:
> We can not check and guess the format of all files. But we must ensure
> that loading and saving a text file without editing it should always
> create exactly the same file again (no hidden conversions), and that
> none of the widgets crashes if there is an unsupported UTF8 sequence.


Maybe I'm not really the right person to be commenting on this
as I'm a native English speaker, I only work on Linux, so I only
need ASCII, not accented Latin characters, never mind Greek or
Cyrillic or Chinese. But having said that...

According to Poll #22, more than a third of FLTK users are from
Western Europe and are therefore likely to be using accented Latin,
characters, 12% are from Eastern Europe, which probably means that
they use Greek or Cyrillic characters, and 15% are from Asia, with
other character sets to support. To reach those people it is clear
that FLTK needs to offer more than just ASCII, ISO-8859-1 or CP1252.

>From an implementation point of view it's clear that pure UTF-8
is the easiest and most flexible way to go to enable access to a
wider range of character sets. But from a pragmatic point of view
many users must already be working with ISO-8859-1 or CP1252, and
we probably need to continue to support them. The good news is
those two character sets are pretty well defined already:

    http://en.wikipedia.org/wiki/ISO/IEC_8859-1
    http://en.wikipedia.org/wiki/Windows-1252

And what is more, there are already hooks in the current FLTK-1.3
code base to recognize and convert those characters. Having seen
the additional complexity and string copying required to support a
hybrid UTF-8 and CP1252 system in http://www.fltk.org/str.php?L2348
I would suggest that we don't support the hybrid system as such.
Instead we enforce UTF-8 internally in the core FLTK widgets, and
ensure that we provide conversion during input or loading from file.

To answer Matt's point above, the Fl_Text_Buffer::insertfile()
method could pass through any characters consisting of plain ASCII
or valid UTF-8 sequences without change. Any ISO-8859-1 or CP1252
characters could be expanded using the existing hooks in the code[*]
into corresponding UTF-8 sequences which could then be added to the
buffer. In this latter case, the buffer would be marked as "dirty"
so that the caller, such as the Fl_Text_Editor class, could issue
a warning that the code had been converted. The caller could then
mark the buffer as "clean" so that it would not need to be saved
unless additional changes were made by the caller. Or there could
be two flags: one for "converted on input" and one for "dirty".

[*] The code is currently set up to use macros that enable the
    conversion of isolated 0x80-0xff bytes outside of UTF-8
    sequences into ISO-8859-1 or CP1252 equivalent Unicode values,
    or convert them into the Unicode REPLACEMENT CHARACTER.
    But you have to edit the source to change these macros.

The current fl_utf8decode() documentation contains the example:

    If you want errors to be converted to error characters (as the
    standards recommend), adding a test to see if the length is
    unexpectedly 1 will work:

    if (*p & 0x80) { // what should be a multibyte encoding
      code = fl_utf8decode(p,end,&len);
      if (len<2) code = 0xFFFD; // Turn errors into REPLACEMENT CHARACTER
    } else { // handle the 1-byte utf8 encoding:
      code = *p;
      len = 1;
    }

Would it not be possible to embed the fl_utf8decode() into a method
in a "character input" class. The base class would handle ASCII and
valid UTF-8 sequences. One derived class could handle ISO-8859-1
conversion, and another could be extended further to handle CP1252.
Users would be free to provide their own derived classes as needed.
The Fl class could have a "character input" class pointer member
so that the user could customise character conversion at the start
of execution.

This last part is just  an idea for the future, not a suggestion for
the FLTK-1.3.0 release.

Sorry for the long post.

Cheers
Duncan
_______________________________________________
fltk-dev mailing list
[email protected]
http://lists.easysw.com/mailman/listinfo/fltk-dev

Re: [fltk.development] RFC: Pure UTF-8 or Hybrid CP1252 ?

Reply via email to