I'm getting double messages -- I think we may have a feedback loop between gitHub and the list .....
But anyway: > Hmmm. Chris, I think you are implying a problem that does not exist. I hope that's true, Sorry if I stirred up confusion. But I was responding to a comment about ASCII vs UTF-8, so .... I also picked this up in email, so was unsure of the context. I've now gone and re-read the issue, and I:m a bit confused about what's still on the table. But way back, someone wrote: " two issues: the use of strings, and the encoding. These can be decided separately, can't they?" and there was another one: arrays of strings vs whitespace separated strings. (I'm also not completely clear about the difference between a char* and a string anyway. Either way, it's a bunch of bytes that need to be interpreted) So I'll just talk about encoding here. A few points: (I know you all know most of this, and most of it has been stated in this thread, but to put it all in one place...) * Encodings are a nightmare: any place that a pile of bytes could be in more than one encoding is a pain in the a$$ for any client software -- think about the earlier days of html! * Being able to use non-ASCII characters is important and unavoidable. We can certainly restrict CF names to ASCII, but it's simply not an option for variables or attributes. (I don't think anyone is suggesting that anyway) and Unicode is the obvious way to support that. So that leaves one open question: what encoding(s) are allowed for a CF compliant file? I'm going to be direct here: THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING It only leads to pain. Period. End of story. If there is one allowed encoding, then all CF compliant software will have to be able to encode/decode that encoding. But ONLY that one! If we allow multiple encodings, than to be fully compliant, all software would have to encode/decode a wide range of encodings, and there would have to be a way to specify the encoding. So all software would have to be more complex, and there would be a lot more room for error. If there is only one encoding allowed, then there are really only two options: UCS-4: because it handles all of Unicode and is the always the same number of bytes per code point. A lot more like the old char* days. However, no one wants to waste all that disk space, so that leaves: UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost universally adopted in most internet exchange formats (those that are sane enough to specify a single encoding :-) ) It is also friendly to older software that uses null-terminated char* and the like, so even old code will probably not break, even if it does misinterpret the non-ascii bytes. And old software that writes plain ascii will also work fine, as ascii ID utf-8. All that's a long way of saying: CF should specify UTF-* as the only correct encoding for all text: char or string. With possibly some extra restrictions to ASCII in some contexts. If that had already been decided, then sorry for the noise :-) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599001824 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.