I'm getting double messages -- I think we may have a feedback loop between 
gitHub and the list .....

But anyway:

> Hmmm. Chris, I think you are implying a problem that does not exist.


I hope that's true, Sorry if I stirred up confusion.

But I was responding to a comment about ASCII vs UTF-8, so ....

I also picked this up in email, so was unsure of the context. I've now gone and 
re-read the issue, and I:m a bit confused about what's still on the table.

But way back, someone wrote:
" two issues: the use of strings, and the encoding. These can be decided 
separately, can't they?"

and there was another one: arrays of strings vs whitespace separated strings.

(I'm also not completely clear about the difference between a char* and a 
string anyway. Either way, it's a bunch of bytes that need to be interpreted)

So I'll just talk about encoding here. A few points:

(I know you all know most of this, and most of it has been stated in this 
thread, but to put it all in one place...)

* Encodings are a nightmare: any place that a pile of bytes could be in more 
than one encoding is a pain in the a$$ for any client software -- think about 
the earlier days of html!

* Being able to use non-ASCII characters is important and unavoidable. We can 
certainly restrict CF names to ASCII, but it's simply not an option for 
variables or attributes. (I don't think anyone is suggesting that anyway) and 
Unicode is the obvious way to support that.

So that leaves one open question: what encoding(s) are allowed for a CF 
compliant file?

I'm going to be direct here:

THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING

It only leads to pain. Period. End of story. If there is one allowed encoding, 
then all CF compliant software will have to be able to encode/decode that 
encoding. But ONLY that one! If we allow multiple encodings, than to be fully 
compliant, all software would have to encode/decode a wide range of encodings, 
and there would have to be a way to specify the encoding. So all software would 
have to be more complex, and there would be a lot more room for error.

If there is only one encoding allowed, then there are really only two options: 

UCS-4: because it handles all of Unicode and is the always the same number of 
bytes per code point. A lot more like the old char* days. However, no one wants 
to waste all that disk space, so that leaves:

UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost 
universally adopted in most internet exchange formats (those that are sane 
enough to specify a single encoding :-) )

It is also friendly to older software that uses null-terminated char* and the 
like, so even old code will probably not break, even if it does misinterpret 
the non-ascii bytes. And old software that writes plain ascii will also work 
fine, as ascii ID utf-8.

All that's a long way of saying:

CF should specify UTF-* as the only correct encoding for all text: char or 
string. With possibly some extra restrictions to ASCII in some contexts.

If that had already been decided, then sorry for the noise :-)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599001824
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Reply via email to