Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
Ah yes, I see what you mean, you are right: Always speaking about UTF-8, multi-byte here isn't referring to the possibility of having several bytes encode one code point, but to actual code points with more than one byte, thus excluding the one-byte code points which are exactly the first 128 ASCII characters. Then they allow back in specific ASCII characters. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600545623 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
> UTF-8 is only an encoding, so we should just say "unicode" for strings. We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else. But even so, once the encoding has been specified, then yes, talking about Unicode makes sense. Agreed, it's not for this discussion, but: `MUTF8` is not quite (In that doc): "any unicode string encoded as normalized UTF-8." because I think they are specifically trying to exclude the ASCII subset, so they can handle that separately. i.e characters that are excluded, like "/" are indeed unicode strings. But it's a pretty contorted way to describe it -- but that's netcdf's problem :-) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600128492 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600067627 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-600065075 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I think there is some confusion here. First, this whole regex stuff is only about the physical byte layout of the netcdf classic file format. I would in principle suggest to completely focus on netcdf4 files instead. Second, I think CF should not concern itself with encodings and byte order stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. And yes, unicode has code points, but also a concept of characters (see [here](https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)). Third, looking at the regex in question ``` ([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})* ``` notice that it is only an explanatory comment, but apart from that the overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as either ``` ([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])* ``` ie an ascii string starting with a character, digit, or underscore, limited to the first 128 bytes without control characters and excluding "/" everywhere or ``` ({MUTF8})({MUTF8})* ``` ie *any* unicode string encoded as normalized UTF-8. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599957114 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
remember that utf-8 is ascii compatible for the first 127 (7 bits). So: 0x00 to 0x1F are the control codes from ASCII 0x7f is the DEL (not sure why that wasn't in the first set..., but there you go. and 0x80 to 0xFF is the rest of the non-ascii bytes -- (128-255), which you have to be able to use in order to do utf-8. But frankly, I have not sure what a regex is with regard to bytes. But if I had to guess, I'd pull it apart this way (which is almost what's in the footnote: first: MUTF8 means: multibyte UTF-8 encoded, NFC-normalized Unicode character However, Unicode doesn't quite use "characters", but rather "Code Points", so that means: Which means any Unicode code point >= 128 (0x80) and above. `([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})` The first character has to be: ([a-zA-Z0-9_]|{MUTF8}): ASCII letter, number or underscore OR any other code point over 128 All the other characters have to be: Any code point other than: \x00-\x1F and \x7F-\xFF OR any code point above 128. Which is an odd way to define it, as the codepoints \x7F-\xFF are valid Unicode, so you're kind of excluding them, and then allowing them again strange. I suspect that this started with the original pre-Unicode definition, and they added the UTF8 part, and got an odd mixture. In particular, there is really no reason to treat the single byte or multibyte UTF codepoints separately, that's just odd. I think I'd write this as: Names are UTF-8 encoded. The first letter can be any of these codepoints: ``` x30 - x39. (digits: 0-9) x41 - x5a (upper case letters: A-Z) x61 - x7a (lower case letters: a-z) c5f (underscore) >= xx80 ``` ``` The rest can be any code point other than: \x00-\x1F or \x7F ``` However, there is a key missing piece: a number of Unicode code points are used for control character and whitespace, and probably other things unsuitable for names. Which may be why they used the term "character". But it would be better if they had clearly defined what's allowed and what s not. For instance, Python3 uses these categories: (https://docs.python.org/3/reference/lexical_analysis.html#identifiers) Lu - uppercase letters Ll - lowercase letters Lt - titlecase letters Lm - modifier letters Lo - other letters Nl - letter numbers I have no idea if those are defined by the Unicode consortium anywhere. But it would be good for netcdf (and or CF) to define it for themselves. I will say that it's kind of nifty to be able to do (in Python): ``` In [17]: π = math.pi In [18]: area = π * r**2 ``` But I'm not sure I need to be able to assign a variable to -- which Python will not allow, but does the netcdf spec allow it? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599804396 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I missed the regex. Yep, that's what it says. 0x7F is the "del" char, so it's non-printing. I think the characters from 0xC0 - 0xFF are out because they would all be interpreted in UTF-8 as signaling the start of a multi-byte character. 0x80 - 0xBF can all be interpreted as trailing elements of a multibyte character, so I guess it's a bad plan to have one lying around loose. [This Wikipedia article](https://en.wikipedia.org/wiki/UTF-8) was informative. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599791736 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@JimBiardCics It's the "not match" group in that regex that is doing it `([^\x00-\x1F/\x7F-\xFF]|{MUTF8})`, at least, I'm pretty sure that is what is going on. I rarely use regex myself, so I could be wrong, but I'm quite sure that the `^` is "not match". -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599756995 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@DocOtak I couldn't find the direct restriction on the 0x80 to 0xFF characters. Is this a side effect of utf-8 using the high bit to signal multibyte characters? Or is it a more general prohibition against using the characters in latin-1 that fall in that range? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599755236 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
Thanks! Yup -- then attributes really do need to be UTF-8 and the STRING type (for text) only. I suppose they don't ALL HAVE to be the STRING type, but the ones that might contain variable names should be. after all, any software that doesn't support the STRING type probably doesn't support Unicode variable names, either ... -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599728983 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
Additionally, the netcdf standard itself has support for UTF-8 variable names, requires them to be [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), and specifically excludes bytes [0x00 to 0x1F and 0x7F to 0xFF](https://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html) (see the "name" part of that document). I think this matters because at least one of the [standard attributes ](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#ancillary-data) needs to be able to refer to variable names. Basically, allowing anything other than UTF-8, especially things that allow bytes 0x7F to 0xFF (like the ISO-8859 series encodings do), would probably cause actual problems. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599690108 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
> Also be aware that LATIN-1 is not compatible with UTF-8 with code points > above 127 Indeed. Which is why it should be clear that you should NOT put utf-8 in a CHAR array :-) We could say ASCII only for CHAR, but I'm not sure there is a good reason to be that restrictive. It may be a implementation detail of the Python encodings, but at least there, latin-1 can decode ANY string of bytes (Other the the null byte) without error, and write it out again with no changes. So if consuming code uses the latin-1 encoding for all CHAR arrays, it may get garbage for the non-ascii bytes, but it won't raise an error, or mangle the data if it is written back out. > the netcdf python library will force the use of strings for netcdf4 files if > it sees unicode points outside of ASCII. which is the right thing to do, and compatible with this proposal, I think. (hmm, unless latin-1 is allowed). But you could probably send a latin-1 encoded bytes object in yes? Anyway, if we codify this, and the netCDF4 lib (or any other) can't support it, it can be fixed. And yes, I am volunteering to do a PR for a fix to netCDF4-python. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599681959 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
> we should, in this specific issue, propose making changes to the CF document > to make it clear that CHAR attributes must be ASCII or latin-1 and STRING > attributes should be unicode/utf-8 +1 on that. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599662220 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@ChrisBarker-NOAA * The sub-issues haven't been split off. * This issue is about attributes only. * The difference between CHAR and STRING for an attribute is, essentially, which type you pick when you create the attribute. With CHAR you can't specify arrays for the value (because it already is a CHAR array) and with STRING you can. Assuming we do spin off sub-issues related to encoding and string array attributes, I agree fully that we should, in this specific issue, propose making changes to the CF document to make it clear that CHAR attributes must be ASCII or latin-1 and STRING attributes should be unicode/utf-8. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599603557 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
> My original observation was that we can absolutely split off some of these > issues. Agreed. Have these been started? I can't find them if they have. There is also the question of what to do with CHAR types -- the same as STRING? And what about encoding of CHAR and STRING variables? I can't find anything about that in the current CF document, so it doesn't seem to be settled. Maybe this should go in a new issue, but for now, I had a (not well formed) thought: CHAR variables and attributes should only be encoded in a 1-byte per character ascii compatible encoding: e.g. ascii, latin-1 STRING variables and attributes should only be encoded in utf-8 (of which ascii is a subset) My justification is that there will be little software in the wild that supports Unicode, but does not support String. Setting this standard will make it less likely that older software that assumes a 1byte per character text representation will get handed something it can't deal with. And the string type is better suited to Unicode anyway, as the "length" of a string is less well defined. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599262170 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@ChrisBarker-NOAA My original observation was that we can absolutely split off some of these issues. I see two issues being peeled off from the base issue. * Define a convention for attributes with multiple strings (string array attributes). * Determine what to do (or not do) regarding different encodings in string attributes. I think you've made a strong case for starting out by specifying ASCII and Unicode / UTF-8 as the only valid contents for string attributes, with one of the two spinoff issues addressing the question of broadening the options. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599211543 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
one small additional nota about Python and Unicode: The post Jim pointed us to: https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ Is now six years old -- and many of the issues brought up have been addressed. And the the author of that post has another post on the dangers of refereeing back to such older opinions: https://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike/ Another issue with that discussion is that it's written from the perspective of what some folks in the community are calling "byte slingers": Those that write libraries and the like that deal with binary data and protocols. And the fact is that Python3's String model is NOT as well suited to those use cases. But it is massively better suited to most more "casual" use cases. In that post, he refers to "beginners", but it's not beginners, it's anyone that does not understand the subtleties of binary data, encodings, and the like. Which is most of us "scientific programers". Bringing this back to CF: For CF, ideally we would choose an approach that is well suited to the "Normal scientific programmer", and leave the encoding/decoding to the libraries. And have confidence that the "byte slingers" will correctly write the libraries to match the standard, and make things "just work" for most users. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599107553 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@JimBiardCics wrote: Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. And I'm afraid you have it a bit confused. This is kind of off-topic, but for clarities sake: > Python 3 is not the same as python 2. Very True, and a source of much confusion. > In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8). Almost right: there were two types: `str`: which was a single byte per character of unknown encoding -- essentially a wrapped char* -- usually ascii compatible, often latin-1, but not if you were Japanese, for instance It was also used a holder of arbitrary binary data: see numpy's "fromstring()" methods, or reading a binary file. Much like how char* is used in C. `unicode`: which was unicode text -- stored internally in UCS-2 or UCS-4 depending on how Python was compiled (I know, really?!?!) It could be encoded / decoded in various encodings for IO and interaction with other systems. > In Python 3 there is only str, and by default it holds UTF-8 unicode Almost right: the Py3 `str` type is indeed Unicode, but it holds a sequence of Unicode code points, which are internally stored in a dynamic encoding depending on the content of the string (really! a very cool optimization, actually, if you have only ascii text, it will use only one byte per char https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden from the user. To the user, a `str` is a sequence of characters from the entire Unicode set, very simply. (Unicode is particularly weird in that one "code point" is not always one character, or "grapheme" to accommodate languages with more complex systems of combining characters, etc, but I digress..) And there are still two types -- in Python3 there is the "bytes" type, which is actually very similar to the old python2 string type -- but intended to hold arbitrary binary data, rather than text. But text is binary data, so it can still hold that. In fact, if you encode a string, you get a bytes object: ``` In [13]: s Out[13]: 'some text' In [14]: b = s.encode("ascii") In [15]: b Out[15]: b'some text' ``` Note the little 'b' before the quote. In that case, they look almost identical, as I encoded in ASCII. But what if I had some non-ASCII text?: ``` In [18]: s = "temp = 10\u00B0" In [19]: s Out[19]: 'temp = 10°' In [20]: b = s.encode("ascii") --- UnicodeEncodeErrorTraceback (most recent call last) in > 1 b = s.encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: ordinal not in range(128) ``` oops, can't do that -- the degree symbol is not part of ASCII. But I can do utf-8: ``` In [21]: b = s.encode("utf-8") In [22]: b Out[22]: b'temp = 10\xc2\xb0' ``` which now displays the byte values, escaping the non-ascii ones. So that bytes object is what would get written to a netcdf file, or any other binary file. And Python can just as easily encode that text in any supported encoding, of which there are many: ``` In [28]: s.encode("utf-16") Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00' ``` But please don't use that one! So anyway, the relevant point here is that there is NOTHING special about utf-8 as far as Python is concerned. And in fact, Python is well suited to handle pretty much any encoding folks choose to use -- but it doesn't help a bit with the fundamental problem that you need to know what the encoding of your data is in in order to use it. And if Python software (like any other) is going to write a netcdf file with non-ascii text in it, it needs to know what encoding to use. The other complication that has come up here is that, IIUC, the netCDF4 Python library (A wrapper around the c libnetcdf) I think makes no distinction between the netcdf types CHAR and STRING (don't quote me on that), but that's a decision of the library authors, not a limitation of Python. Actually, it does seem to give the user some control: https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring Note that utf-8 is the default, but you can do whatever you want. In any case, the Python libraries can be made to work with anything reasonable CF decides, even if I have to write the PRs myself :-) Sorry to be so long winded, but this IS confusing stuff! -- You are receiving this
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
I'm getting double messages -- I think we may have a feedback loop between gitHub and the list . But anyway: > Hmmm. Chris, I think you are implying a problem that does not exist. I hope that's true, Sorry if I stirred up confusion. But I was responding to a comment about ASCII vs UTF-8, so I also picked this up in email, so was unsure of the context. I've now gone and re-read the issue, and I:m a bit confused about what's still on the table. But way back, someone wrote: " two issues: the use of strings, and the encoding. These can be decided separately, can't they?" and there was another one: arrays of strings vs whitespace separated strings. (I'm also not completely clear about the difference between a char* and a string anyway. Either way, it's a bunch of bytes that need to be interpreted) So I'll just talk about encoding here. A few points: (I know you all know most of this, and most of it has been stated in this thread, but to put it all in one place...) * Encodings are a nightmare: any place that a pile of bytes could be in more than one encoding is a pain in the a$$ for any client software -- think about the earlier days of html! * Being able to use non-ASCII characters is important and unavoidable. We can certainly restrict CF names to ASCII, but it's simply not an option for variables or attributes. (I don't think anyone is suggesting that anyway) and Unicode is the obvious way to support that. So that leaves one open question: what encoding(s) are allowed for a CF compliant file? I'm going to be direct here: THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING It only leads to pain. Period. End of story. If there is one allowed encoding, then all CF compliant software will have to be able to encode/decode that encoding. But ONLY that one! If we allow multiple encodings, than to be fully compliant, all software would have to encode/decode a wide range of encodings, and there would have to be a way to specify the encoding. So all software would have to be more complex, and there would be a lot more room for error. If there is only one encoding allowed, then there are really only two options: UCS-4: because it handles all of Unicode and is the always the same number of bytes per code point. A lot more like the old char* days. However, no one wants to waste all that disk space, so that leaves: UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost universally adopted in most internet exchange formats (those that are sane enough to specify a single encoding :-) ) It is also friendly to older software that uses null-terminated char* and the like, so even old code will probably not break, even if it does misinterpret the non-ascii bytes. And old software that writes plain ascii will also work fine, as ascii ID utf-8. All that's a long way of saying: CF should specify UTF-* as the only correct encoding for all text: char or string. With possibly some extra restrictions to ASCII in some contexts. If that had already been decided, then sorry for the noise :-) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599001824 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
Chris, Python 3 is not the same as python 2. In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8). In Python 3 there is only str, and by default it holds UTF-8 unicode (there's lots of subtly that I'm glossing over here, but this is what it boils down to). It bit me recently, so I'm sensitive to it. https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ https://docs.python.org/3/howto/unicode.html -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598958361 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
Hmmm. Chris, I think you are implying a problem that does not exist. I do not think CF has ever restricted the use of UTF-8 in free text within attributes. I suspect there are many UTF-8 attribute examples in the wild, though I do not have one up my sleeve right now. Please correct me if I'm wrong. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598956209 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@Dave-Allured I approve of your proposal. I think we pretty much have no choice but to allow UTF-8 as a baseline to start with, but there clearly are larger issues to be resolved. (I say "no choice" because, for example, constraining to ASCII in python 3 is a bit complicated.) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598914012 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@Dave-Allured That sounds OK to me, whatever does get adopted, should probably be pretty explicit about what is "not allowed". -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598912616 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
@DocOtak, thanks for restarting this. In light of past difficulties, I move to split the issue. I think it would be possible to break out the more difficult parts of this topic into new and separate issues. I suggest that this issue #141 be narrowed to only a single essential ingredient: scalar string-type attributes as an alternative for traditional character-type attributes. Can we agree to move the following to new Github issues, and focus for now only on legalizing scalar string-type attributes? * Array string-type attributes. * All discussion of character sets, UTF-8 or otherwise. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598909876 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.
Re: [CF-metadata] [cf-convention/cf-conventions] Add support for attributes of type string (#141)
So now that [string variables have landed](https://github.com/cf-convention/cf-conventions/pull/140), I want to bring some attention to this issue again. Some updates I've learned about: * Because strings are allowed in variables, some are assuming that string attributes are allowed by CF 1.8, especially because [Appendix A](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#attribute-appendix) uses the word "string" instead of... text? whatever `nc_get_att_text()` returns... * This is now being used as a "CF-1.8" conforming recommendation from [PODAAC](https://podaac.jpl.nasa.gov/) for [pinniped data](https://github.com/oceandatainterop/nc-eTAG/blob/master/nc-eTAG_Archival_Template.cdl). * I personally see use of string arrays as having extreme utility for lists of peoples names (mostly ACDD attrs), flag definitions, and anything else that is currently "white space separated" in the CF standard. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-598851406 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.