Re: UTF-16

Randy Abernethy Thu, 31 Dec 2015 11:43:37 -0800

Hey Jens,

I would vote to keep Thrift simple and standardized on UTF-8 alone. The
simple part is the main thing for me.

-Randy

TL;DR

In my experience many lament the 16 bit choice once made. Originally 16 bit
Unicode (UCS-2) had no surrogates (as you mention), it was thought all of
the important characters could fit into 16 bits. Fixed size characters make
strings easy to navigate (char 5 is at bytes 9/10). When fixed size went
out the window, what was supposed to be a larger but simple fixed sized
character standard turned into the worst of both worlds, variable character
size (5 chars != 10 bytes) and 2 bytes minimum storage (even if all of your
characters only use bits 1-7), doubling string sizes initially for many
applications and complicating parsing. 16 bit chars also introduced the
importance of byte ordering, BOMs in files, on the wire, etc.

Many platforms based on 16 bit characters went that way because at the time
they were created (early 90s) it appeared that 16 bits would be the new
global standard character scheme (IEEE and the Unicode Consortium were both
going this way). Windows (NT) and Java were born in this era and both use
16 bit characters as their native scheme.
https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html

Given hindsight I would suspect most would use UTF-8. UTF-8 came up in the
Unix community (X/Open etc.) and was first implemented by Pike and Thompson
(the Go guys, i.e. Go uses UTF-8 as its native char). UTF-8 can store chars
in 8 bits when viable, uses surrogates to cover the needs of the globe in
other cases and has no byte ordering issues. UTF-8 strings are
self-synchronizing (you can jump in anywhere and start reading chars) can
be sorted as unsigned bytes to produce the same order as it would if the
chars were parsed and sorted, can be IDed without BOM, etc. Everything
about UTF-16 is equal or worse, with the exception of the fact that some
chars are 16 bits in UTF-16 and 24 bits in UTF-8.

It is hard to know which format is used the most. Microsoft unsurprisingly
says UTF-16 (the standard built into Windows, its NT API and .Net) in your
link:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx

Python (which uses UTF-8) people say UTF-8 is more common:
https://docs.python.org/2/howto/unicode.html

Per Wikipedia UTF-8 is the dominant character encoding for the web,
accounting for 85.1% of all Web pages in September 2015. The Internet Mail
Consortium (IMC) recommends that all e-mail programs be able to display and
create mail using UTF-8 and the W3C recommends UTF-8 as the default
encoding in XML and HTML. https://en.wikipedia.org/wiki/UTF-8

I have never had much trouble converting from UTF-16 to UTF-8 when need be,
largely due to the growing ubiquity of UTF-8. The Windows API even offers
MBCS version of all calls with string params. I could see serialization
benefits to adding UTF-16 but byte ordering would be a concern to address
if any real gains are to be had. Wouldn't want to swap everything and then
unswap it. All in all I think UTF-8 is a great wire format for Thrift chars.

On Thu, Dec 31, 2015 at 3:32 AM, Jens Geyer <[email protected]> wrote:

> Hi all,
>
> while UTF-8 is great, especially on Windows platforms UTF-16 is more
> common, because the OS uses it heavily internally. Since Win2k it also
> supports surrogates and supplementary characters. So there’s OS support for
> it. What I don’t know is, how universally is UTF-16 (or a subset of it)
> supported across other platforms? Can we assume a certain degree of support
> on all the various platforms that Thrift can run on?
>
>
> TL;DR: Would it make sense to add UTF-16 as another string format type?
>
> Have fun,
> JensG
>
>
> Unicode in Windows
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx
>
> Surrogates and Supplementary Characters
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx
>

Re: UTF-16

Reply via email to