On 15/04/13 14:24, Martin Schreiber wrote:
> 
> Eh, what about decomposed characters and 1,2,3 or 4 byte code points?

UTF-8 implementations handle 1-4 byte code points as standard. It is
part of any UTF-8 implementation. So a standard UTF-8 implementation can
handle Unicode Planes 0-16 without any problems.

Now when it comes to UTF-16, many developers are lazy and think that the
BMP (Plane 0) is more than enough and covers all spoken languages - so
they implement UTF-16 to only work with 2-bytes. That is just WRONG,
because now Planes 1-16 are not supported. UCS2 (what MSEgui supports)
is just Unicode Plane 0.


Decomposed characters is a total different subject, and affects UTF-8,
UTF-16 and UTF-32. This topic is what probably confuses most people.
They think that because you see one "character" on the screen, it must
be one code point. That assumption is simply not true. Different OSes
handle this different. I believe (not 100% sure) that Mac OS X always
decomposes "characters" on the file system level. I believe Linux is the
opposite. No idea what Windows does.


>> and no worries about Little Endian or 
>> Big Endian (which is also to be considered when using UTF-16).
>>
> In process space, really? Please explain.

If one makes the claim that a framework fully supports UTF-16, then that
framework must also be able to handle UTF-16 encoded files (even though
they are rare). In that case, such a framework must also take into
account the type of UTF-16 encoding, which could be LE or BE.

With UTF-8, there is just one type on encoding, and you don't need to
worry about endianess.


> MSEgui does not use utf-16 on disk 
> or over the wire.

Which brings up another advantage of UTF-8. UTF-8 is probably the most
common on disk storage [unicode] encoding. UTF-8 encoded data is already
in a byte format, so is ideal for on disk storage or streaming (over the
wire). If you have a UTF-16 framework, they must convert to and from
UTF-8 for on disk storage.


Anyway, I didn't want to get into this whole UTF-8 vs UTF-16 debate.
There never are winners. But blatantly stating that UTF-8 is multitudes
slower that UTF-16 is just so wrong. And that is the statement I wanted
to correct with my reply.

Regards,
  - Graeme -



------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
mseide-msegui-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mseide-msegui-talk

Reply via email to