On 10.12.2010, at 20:16, Duncan Gibson wrote: > The discussion in STR 2348 (http://www.fltk.org/str.php?L2348) with > the title "test/editor fails to displaymisc/cp1252.txt and can hang" > seems to have reached some agreement, for the short term anyway, that > bytes that have their top bit set but are not part of a valid UTF-8 > sequence should be considered as part of the CP1252 superset of the > ISO-8859-1 character set and converted to the equivalent UTF-8. > > In the short term this will address many users needs, and will allow > the developers to make progress towards releasing FLTK-1.3.0. > > Under FLTK-1.1 I don't know whether or how people have been able to > display other, non-UTF-8, character sets, but that avenue may be > closed to them once the currently envisaged patch for STR2348 has > been applied.
As far as 8-bit character sets are concerned, there was probably no need to do anything in FLTK 1.1. All text was (and is and will always be) interpreted according to the current codepage (Windows) or locale (Unix/Linux). On Mac OS X the default character set was (and probably still is) Mac Roman, I don't know if you can affect it by settin the locale. "Wide" character sets are not supported, and that was the reason why we chose to switch to UTF-8... (But you may all know that.) > Therefore I propose to raise an STR/RFE against FLTK-1.4 so that we > can have suggestions on how to support not just the CP1252 superset > of ISO-8859-1, but also the other ISO-8859-* character sets that > are still supported, which would cover almost all European languages > (and some Arabic, Hebrew and Thai if I remember correctly). I assume > that the MacRoman and other Mac* character sets would also fall under > the same scheme. I don't know whether we could design the API in > such a way that we could include UTF-16 and wide character encodings. [ Details about potential poll elided ] > Comments? We can not support all character sets, not even other variants of ISO-8859-<whatever>, unless we know the encoding _before_ we start processing the input. Even *if* we would know, we would need all kinds of character mapping tables for all possible character sets. My opinion is that even ISO-8859... would be too much tables, and if we start adding mapping tables, this will never end. Currently (with Manolo's recent patch) we have two exceptions: (1) ISO-8859-1: no mapping tables needed (2) Windows CP-1252: only 32 characters needed (0x80 - 0x9F). IMHO there's a simple solution: don't care about different native character sets of anything, unless it has the potential to crash FLTK. FLTK 1.3 uses UTF-8 internally. All user input will be encoded as UTF-8. We just need to document that all strings and values that are entered into widgets (e.g. by the value(), append(), or similar methods) *MUST* be UTF-8 encoded. Otherwise FLTK's behavior is /undefined/. "Undefined" includes occasional crashes. Maybe there is one exception to this rule: Fl_Text_Buffer's direct file loading methods that read a file with a documented API, and that have the potential to crash the application because Fl_Text_Buffer/Display/Editor is (or was) not very robust if fed with non-UTF-8 text. Now we have a clear definition: These methods read files encoded in UTF-8, and otherwise "guess" that they are encoded as ISO-8859-1 or CP1252, respectively. IMHO we can't do any better. As you (Duncan) said, we can't even support the widely used MacRoman character set, and also our current scheme would not be able to read UTF-16 encoding. So, what to do? My proposal is to add a hook, as proposed by Manolo (STR 2348) for *user-defined* text converting functions when reading files (Fl_Text_Buffer methods). If needed, the callback functions must be set before we read the file, and the user/programmer is responsible for the correctness of the chosen function and its operation. Again, I propose to add two exceptions: (1) UTF-16 (both LE and BE) (2) MacRoman. UTF-16 should be implemented in the core, because it does not need character conversion tables (and may be widely used on Windows), and MacRoman should be included because we should support the native (8-bit) Mac OS character set, because Mac OS is a long supported platform. Note that both conversion methods would have to be requested explicitly before opening and reading a file by setting the callback (conversion) function. If this proposal finds consensus, then we're almost there, given that we compile one more conversion function (UTF-16, but maybe with an additional option to select the endianness, instead of making it dependent on the current processor, and/or interpreting the BOM markers, or whatever they're called). Then we would only need the (128-byte) MacRoman conversion table and the corresponding function. Albrecht _______________________________________________ fltk-dev mailing list [email protected] http://lists.easysw.com/mailman/listinfo/fltk-dev
