imacarthur wrote:

> There does seem to be a move (in so far as I think I read an RFC for 
> this, regarding usage with html, but can not now find it...) towards 
> assuming that any character in a nominally utf-8 text that is found to 
> lie in the "awkward" U+0x80 to U+0x9F range is actually a CP-1252 
> character and respond accordingly.

Yes I believe what fltk2 did originally is the right thing. All bytes in 
a "utf-8" string that are invalid encodings should be assumed to be the 
CP1252 encoding (ie each byte is translated to the correct Unicode for 
the CP1252 encoding). This will make ISO-8859-1 and CP1252 text that is 
mistakenly identified as UTF-8 display correctly.

There are concerns that this is a security problem, however it should 
only cause multiple UTF-8 strings to map to the same Unicode. This is 
already true for Windows filesystems due to case independence, and on 
OS/X due to it's automatic normalization of Unicode (both of these 
mistakes we can do nothing about). Therefore the mild extra mapping 
duplication is not imho a real security problem.

There is a problem that the draw-utf-8 string apis on systems do not do 
this. FLTK will have to translate all strings first to Unicode indexes 
and then call a different api.

> The reasoning being, presumably, that there are a lot of pages out there 
> that claim to be utf-8 but were actually written on Windows machines and 
> are actually some form of CP125x encoding, so making this simple 
> assumption pretty much fixes things. Well, sort of. Or I could just be 
> making this up - I am sure I did read it somewhere though!

I think you are also suggesting that a valid UTF-8 string containing 
encodings of 0x80-0x9f treat those as being CP1252. I'm not so certain 
about this. If it is needed, it should only be done by code that prints 
the string, not by translators from UTF-8 to other encodings.
_______________________________________________
fltk mailing list
[email protected]
http://lists.easysw.com/mailman/listinfo/fltk

Reply via email to