Re: [cf-convention/cf-conventions] Add support for variables of type string (#139)

Chris Barker Mon, 23 Jul 2018 13:51:01 -0700

@hrajagers brings up encoding -- always a challenge!

Any reason we can't say that ALL "things" of the staring type are utf-8? 
period, end of story. 
I'd also love to say that all CHAR data is ASCII (or maybe latin-1) -- if you 
need unicode, use a string.


The odds are very good that if you are dealing with software that can only 
handle char, it isn't going to handle Unicode well anyway.

**Reasoning:**

For "over the wire" encoding, utf-8 encoding is the best choice, and has become 
a defacto standard - (and an actual standard for , say JSON for example). And 
lots of people think "utf-8" == "Unicode" -- they are wrong, but if we always 
use utf-8, then people and tools that handle unicode properly will work well, 
and tools and people that don't will still mostly work.

See: http://utf8everywhere.org/ for a strong opinion. Personally, I think they 
are wrong about "in memory", but their arguments do apply to "on disk" or "over 
the wire" -- essentially any interchange situation.

As for ASCII for CHAR -- the char type (at least in arrays) has to be fixed 
length -- utf-8 is not a fixed-length encoding -- that is, 10 "characters" may 
require 10 or more bytes to store. And it a string is truncated naively, it 
could result in an invalid string.

Since netcdf provides a variable length string that's the obvious way to deal 
with Unicode.






-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/139#issuecomment-407196650

Re: [cf-convention/cf-conventions] Add support for variables of type string (#139)

Reply via email to