Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Chris Barker
> UTF-8 is only an encoding, so we should just say "unicode" for strings.

We could do that if and only if netcdf itself was clear about how Unicode is 
encoded in files. Which it is for variable names, though not so sure it is 
anywhere else.

But even so, once the encoding has been specified, then yes, talking about 
Unicode makes sense. 

Agreed, it's not for this discussion, but:

`MUTF8` is not quite (In that doc): "any unicode string encoded as normalized 
UTF-8." because I think they are specifically trying to exclude the ASCII 
subset, so they can handle that separately. i.e characters that are excluded, 
like "/" are indeed unicode strings.

But it's a pretty contorted way to describe it -- but that's netcdf's problem 
:-)




-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600128492

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Klaus Zimmermann
I agree and would go one small step further: UTF-8 is only an encoding, so we 
should just say "unicode" for strings. If we need to restrict that, say to 
disallow underscore in the beginning or to save a separation character like 
space in attributes right now, we should do so at the character level, possibly 
using categories as introduces by @ChrisBarker-NOAA above.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600067627

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.

Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread JimBiardCics
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of 
exactly why NUG worded things the way they did is intriguing, but I think Klaus 
is right that we shouldn't get wrapped around that particular axle in this 
issue — particularly if we are going to split encoding off into a different 
issue. I think the take-away is that our baseline is "sane utf-8 unicode" for 
attributes of type  NC_STRING and ASCII for attributes of type NC_CHAR (those 
created with the C function nc_put_att_text.)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600065075
This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.


Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)

2020-03-17 Thread Klaus Zimmermann
I think there is some confusion here.

First, this whole regex stuff is only about the physical byte layout of the 
netcdf classic file format. I would in principle suggest to completely focus on 
netcdf4 files instead.

Second, I think CF should not concern itself with encodings and byte order 
stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. 
And yes, unicode has code points, but also a concept of characters (see 
[here](https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)).

Third, looking at the regex in question
```
([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})*
```
notice that it is only an explanatory comment, but apart from that the 
overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as 
either
```
([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])*
```
ie an ascii string starting with a character, digit, or underscore, limited to 
the first 128 bytes without control characters and excluding "/" everywhere or
```
({MUTF8})({MUTF8})*
```
ie *any* unicode string encoded as normalized UTF-8.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599957114

This list forwards relevant notifications from Github.  It is distinct from 
cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the 
UCAR list will result in a subscription to this list.
To unsubscribe from this list only, send a message to 
cf-metadata-unsubscribe-requ...@listserv.llnl.gov.