From: Marcus Bointon <[EMAIL PROTECTED]>
Date: Thu, 3 Aug 2006 13:17:42 +0100
I may be wrong here, but I'm fairly sure that the dominant unicode
library (IBM's ICU) is centred around UTF-16.
ElfData isn't too bad at doing Unicode stuff. Maybe nowhere near as
rich, but it does a lot of stuff still. It even does NFD and NFC.
Also, I do NFD and NFC on UTF-8, directly.
I've been told over and over that this isn't possible.
I know before I wrote this code, that it is possible, and also it
will be fast and simple to implement (for me).
They told me it wasn't possible, still.
I went and built it, and showed them.
Then they shut up :)
That sounds like a good
reason for using it. Generally I've got the impression that UTF-8 is
much better for web use as it's more space-efficient, but it's also
apparently slower to process than UTF-16, which would explain the
choice in a library.
Not necessarily. I haven't seen any evidence that processing it is
slower, and I know that because of it's compactness it could even be
quicker. The fact that we don't have to interconvert to UTF-8 also
speeds things up.
UTF-16 has endian issues too, which UTF-8 does not.
And it's very reliable to detect if text is valid UTF-8 even without
a BOM. I have such a detection function in my ElfData plugin. You
can't reliably detect if text is UTF-16, without a BOM, unfortunately.
I know that Valentina went UTF-16 for precisely this reason.
Could be a mistake :( Is he processing the full code points? If he
is, then the variable widthness of UTF-16 kills off the advantage
over utf-8.
RB's regex requires UTF-8, btw. If UTF-16 is so much easier, then why
is it using UTF-8?
--
http://elfdata.com/plugin/
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>