Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
> UTF-8 is only an encoding, so we should just say "unicode" for strings. We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else. But even so, once the encoding has been specified, then yes, talking about Unicode makes sense. Agreed, it's not for this discussion, but: `MUTF8` is not quite (In that doc): "any unicode string encoded as normalized UTF-8." because I think they are specifically trying to exclude the ASCII subset, so they can handle that separately. i.e characters that are excluded, like "/" are indeed unicode strings. But it's a pretty contorted way to describe it -- but that's netcdf's problem :-) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600128492 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600067627 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600065075 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I think there is some confusion here. First, this whole regex stuff is only about the physical byte layout of the netcdf classic file format. I would in principle suggest to completely focus on netcdf4 files instead. Second, I think CF should not concern itself with encodings and byte order stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. And yes, unicode has code points, but also a concept of characters (see [here](https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)). Third, looking at the regex in question ``` ([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})* ``` notice that it is only an explanatory comment, but apart from that the overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as either ``` ([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])* ``` ie an ascii string starting with a character, digit, or underscore, limited to the first 128 bytes without control characters and excluding "/" everywhere or ``` ({MUTF8})({MUTF8})* ``` ie *any* unicode string encoded as normalized UTF-8. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599957114 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.