On 15/04/13 14:24, Martin Schreiber wrote: > > Eh, what about decomposed characters and 1,2,3 or 4 byte code points?
UTF-8 implementations handle 1-4 byte code points as standard. It is part of any UTF-8 implementation. So a standard UTF-8 implementation can handle Unicode Planes 0-16 without any problems. Now when it comes to UTF-16, many developers are lazy and think that the BMP (Plane 0) is more than enough and covers all spoken languages - so they implement UTF-16 to only work with 2-bytes. That is just WRONG, because now Planes 1-16 are not supported. UCS2 (what MSEgui supports) is just Unicode Plane 0. Decomposed characters is a total different subject, and affects UTF-8, UTF-16 and UTF-32. This topic is what probably confuses most people. They think that because you see one "character" on the screen, it must be one code point. That assumption is simply not true. Different OSes handle this different. I believe (not 100% sure) that Mac OS X always decomposes "characters" on the file system level. I believe Linux is the opposite. No idea what Windows does. >> and no worries about Little Endian or >> Big Endian (which is also to be considered when using UTF-16). >> > In process space, really? Please explain. If one makes the claim that a framework fully supports UTF-16, then that framework must also be able to handle UTF-16 encoded files (even though they are rare). In that case, such a framework must also take into account the type of UTF-16 encoding, which could be LE or BE. With UTF-8, there is just one type on encoding, and you don't need to worry about endianess. > MSEgui does not use utf-16 on disk > or over the wire. Which brings up another advantage of UTF-8. UTF-8 is probably the most common on disk storage [unicode] encoding. UTF-8 encoded data is already in a byte format, so is ideal for on disk storage or streaming (over the wire). If you have a UTF-16 framework, they must convert to and from UTF-8 for on disk storage. Anyway, I didn't want to get into this whole UTF-8 vs UTF-16 debate. There never are winners. But blatantly stating that UTF-8 is multitudes slower that UTF-16 is just so wrong. And that is the statement I wanted to correct with my reply. Regards, - Graeme - ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ mseide-msegui-talk mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mseide-msegui-talk

