imacarthur wrote: > There does seem to be a move (in so far as I think I read an RFC for > this, regarding usage with html, but can not now find it...) towards > assuming that any character in a nominally utf-8 text that is found to > lie in the "awkward" U+0x80 to U+0x9F range is actually a CP-1252 > character and respond accordingly.
Yes I believe what fltk2 did originally is the right thing. All bytes in a "utf-8" string that are invalid encodings should be assumed to be the CP1252 encoding (ie each byte is translated to the correct Unicode for the CP1252 encoding). This will make ISO-8859-1 and CP1252 text that is mistakenly identified as UTF-8 display correctly. There are concerns that this is a security problem, however it should only cause multiple UTF-8 strings to map to the same Unicode. This is already true for Windows filesystems due to case independence, and on OS/X due to it's automatic normalization of Unicode (both of these mistakes we can do nothing about). Therefore the mild extra mapping duplication is not imho a real security problem. There is a problem that the draw-utf-8 string apis on systems do not do this. FLTK will have to translate all strings first to Unicode indexes and then call a different api. > The reasoning being, presumably, that there are a lot of pages out there > that claim to be utf-8 but were actually written on Windows machines and > are actually some form of CP125x encoding, so making this simple > assumption pretty much fixes things. Well, sort of. Or I could just be > making this up - I am sure I did read it somewhere though! I think you are also suggesting that a valid UTF-8 string containing encodings of 0x80-0x9f treat those as being CP1252. I'm not so certain about this. If it is needed, it should only be done by code that prints the string, not by translators from UTF-8 to other encodings. _______________________________________________ fltk mailing list [email protected] http://lists.easysw.com/mailman/listinfo/fltk

