Hey Jens,
I would vote to keep Thrift simple and standardized on UTF-8 alone. The simple part is the main thing for me. -Randy TL;DR In my experience many lament the 16 bit choice once made. Originally 16 bit Unicode (UCS-2) had no surrogates (as you mention), it was thought all of the important characters could fit into 16 bits. Fixed size characters make strings easy to navigate (char 5 is at bytes 9/10). When fixed size went out the window, what was supposed to be a larger but simple fixed sized character standard turned into the worst of both worlds, variable character size (5 chars != 10 bytes) and 2 bytes minimum storage (even if all of your characters only use bits 1-7), doubling string sizes initially for many applications and complicating parsing. 16 bit chars also introduced the importance of byte ordering, BOMs in files, on the wire, etc. Many platforms based on 16 bit characters went that way because at the time they were created (early 90s) it appeared that 16 bits would be the new global standard character scheme (IEEE and the Unicode Consortium were both going this way). Windows (NT) and Java were born in this era and both use 16 bit characters as their native scheme. https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html Given hindsight I would suspect most would use UTF-8. UTF-8 came up in the Unix community (X/Open etc.) and was first implemented by Pike and Thompson (the Go guys, i.e. Go uses UTF-8 as its native char). UTF-8 can store chars in 8 bits when viable, uses surrogates to cover the needs of the globe in other cases and has no byte ordering issues. UTF-8 strings are self-synchronizing (you can jump in anywhere and start reading chars) can be sorted as unsigned bytes to produce the same order as it would if the chars were parsed and sorted, can be IDed without BOM, etc. Everything about UTF-16 is equal or worse, with the exception of the fact that some chars are 16 bits in UTF-16 and 24 bits in UTF-8. It is hard to know which format is used the most. Microsoft unsurprisingly says UTF-16 (the standard built into Windows, its NT API and .Net) in your link: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx Python (which uses UTF-8) people say UTF-8 is more common: https://docs.python.org/2/howto/unicode.html Per Wikipedia UTF-8 is the dominant character encoding for the web, accounting for 85.1% of all Web pages in September 2015. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8 and the W3C recommends UTF-8 as the default encoding in XML and HTML. https://en.wikipedia.org/wiki/UTF-8 I have never had much trouble converting from UTF-16 to UTF-8 when need be, largely due to the growing ubiquity of UTF-8. The Windows API even offers MBCS version of all calls with string params. I could see serialization benefits to adding UTF-16 but byte ordering would be a concern to address if any real gains are to be had. Wouldn't want to swap everything and then unswap it. All in all I think UTF-8 is a great wire format for Thrift chars. On Thu, Dec 31, 2015 at 3:32 AM, Jens Geyer <[email protected]> wrote: > Hi all, > > while UTF-8 is great, especially on Windows platforms UTF-16 is more > common, because the OS uses it heavily internally. Since Win2k it also > supports surrogates and supplementary characters. So there’s OS support for > it. What I don’t know is, how universally is UTF-16 (or a subset of it) > supported across other platforms? Can we assume a certain degree of support > on all the various platforms that Thrift can run on? > > > TL;DR: Would it make sense to add UTF-16 as another string format type? > > Have fun, > JensG > > > Unicode in Windows > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx > > Surrogates and Supplementary Characters > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx >
