Re: [fltk.development] fltk-1.4: supported character sets ?

Albrecht Schlosser Sat, 11 Dec 2010 05:40:14 -0800

On 10.12.2010, at 20:16, Duncan Gibson wrote:
> The discussion in STR 2348 (http://www.fltk.org/str.php?L2348) with
> the title "test/editor fails to displaymisc/cp1252.txt and can hang"
> seems to have reached some agreement, for the short term anyway, that
> bytes that have their top bit set but are not part of a valid UTF-8
> sequence should be considered as part of the CP1252 superset of the
> ISO-8859-1 character set and converted to the equivalent UTF-8.
>
> In the short term this will address many users needs, and will allow
> the developers to make progress towards releasing FLTK-1.3.0.
>
> Under FLTK-1.1 I don't know whether or how people have been able to
> display other, non-UTF-8, character sets, but that avenue may be
> closed to them once the currently envisaged patch for STR2348 has
> been applied.


As far as 8-bit character sets are concerned, there was probably no
need to do anything in FLTK 1.1.  All text was (and is and will always
be) interpreted according to the current codepage (Windows) or locale
(Unix/Linux). On Mac OS X the default character set was (and probably
still is) Mac Roman, I don't know if you can affect it by settin the
locale. "Wide" character sets are not supported, and that was the
reason why we chose to switch to UTF-8... (But you may all know that.)

> Therefore I propose to raise an STR/RFE against FLTK-1.4 so that we
> can have suggestions on how to support not just the CP1252 superset
> of ISO-8859-1, but also the other ISO-8859-* character sets that
> are still supported, which would cover almost all European languages
> (and some Arabic, Hebrew and Thai if I remember correctly). I assume
> that the MacRoman and other Mac* character sets would also fall under
> the same scheme. I don't know whether we could design the API in
> such a way that we could include UTF-16 and wide character encodings.

[ Details about potential poll elided ]

> Comments?

We can not support all character sets, not even other variants of
ISO-8859-<whatever>, unless we know the encoding _before_ we start
processing the input. Even *if* we would know, we would need all
kinds of character mapping tables for all possible character sets.
My opinion is that even ISO-8859... would be too much tables, and
if we start adding mapping tables, this will never end.

Currently (with Manolo's recent patch) we have two exceptions:
  (1) ISO-8859-1: no mapping tables needed
  (2) Windows CP-1252: only 32 characters needed (0x80 - 0x9F).

IMHO there's a simple solution: don't care about different native
character sets of anything, unless it has the potential to crash
FLTK. FLTK 1.3 uses UTF-8 internally. All user input will be
encoded as UTF-8. We just need to document that all strings and
values that are entered into widgets (e.g. by the value(),
append(), or similar methods) *MUST* be UTF-8 encoded. Otherwise
FLTK's behavior is /undefined/. "Undefined" includes occasional
crashes.

Maybe there is one exception to this rule: Fl_Text_Buffer's
direct file loading methods that read a file with a documented
API, and that have the potential to crash the application
because Fl_Text_Buffer/Display/Editor is (or was) not very robust
if fed with non-UTF-8 text.

Now we have a clear definition: These methods read files encoded
in UTF-8, and otherwise "guess" that they are encoded as ISO-8859-1
or CP1252, respectively. IMHO we can't do any better.

As you (Duncan) said, we can't even support the widely used
MacRoman character set, and also our current scheme would
not be able to read UTF-16 encoding.

So, what to do? My proposal is to add a hook, as proposed
by Manolo (STR 2348) for *user-defined* text converting
functions when reading files (Fl_Text_Buffer methods). If
needed, the callback functions must be set before we read
the file, and the user/programmer is responsible for the
correctness of the chosen function and its operation.

Again, I propose to add two exceptions:

  (1) UTF-16 (both LE and BE)
  (2) MacRoman.

UTF-16 should be implemented in the core, because it does not
need character conversion tables (and may be widely used on
Windows), and MacRoman should be included because we should
support the native (8-bit) Mac OS character set, because Mac OS
is a long supported platform.

Note that both conversion methods would have to be requested
explicitly before opening and reading a file by setting the
callback (conversion) function.

If this proposal finds consensus, then we're almost there,
given that we compile one more conversion function (UTF-16,
but maybe with an additional option to select the endianness,
instead of making it dependent on the current processor,
and/or interpreting the BOM markers, or whatever they're
called). Then we would only need the (128-byte) MacRoman
conversion table and the corresponding function.

Albrecht
_______________________________________________
fltk-dev mailing list
[email protected]
http://lists.easysw.com/mailman/listinfo/fltk-dev

Re: [fltk.development] fltk-1.4: supported character sets ?

Reply via email to