On 04/28/2010 01:55 AM, MacArthur, Ian (SELEX GALILEO, UK) wrote: > OK - yes, this is a mess. I think the assumption was always that we were > (somehow) going to make the input text utf8 "clean" when we read it, > then the majority of the functions and methods would never have to worry > about this stuff. > So far, that doesn't seem to have worked!
I don't believe trying to make UTF-8 data "clean" is *ever* going to work, and misguided attempts to do so are probably the main reason I18N is about 25 years behind schedule (UTF-8 was invented 25 years ago, believe it or not! And we STILL don't have Unicode filenames). If you have an array of bytes and some combinations are "illegal", it does not help to pretend that some magical part of the computer hardware will make them not happen. It also does not help to throw errors and refuse to display anything and otherwise do denial-of-service when the bytes happen to be "wrong". The entire idea would be *insane* with any other data structure or communication format (imagine if sending files was aborted if there were spelling errors in the file), but for some reason the term "characters" causes otherwise intelligent programmers to turn into the most incredible morons (or idiot savants, really) and they will go to unbelievable contortions to somehow pretend that the hardware is dividing up the data at irregular boundaries. Another place to look is at the users of UTF-16. They don't worry about errors and handle them just fine (UTF-16 in theory can have errors when you have unmatched surrogate halves). The same applies to UTF-8. _______________________________________________ fltk-dev mailing list [email protected] http://lists.easysw.com/mailman/listinfo/fltk-dev
