@JimBiardCics wrote: Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. And I'm afraid you have it a bit confused. This is kind of off-topic, but for clarities sake:
> Python 3 is not the same as python 2. Very True, and a source of much confusion. > In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8). Almost right: there were two types: `str`: which was a single byte per character of unknown encoding -- essentially a wrapped char* -- usually ascii compatible, often latin-1, but not if you were Japanese, for instance.... It was also used a holder of arbitrary binary data: see numpy's "fromstring()" methods, or reading a binary file. Much like how char* is used in C. `unicode`: which was unicode text -- stored internally in UCS-2 or UCS-4 depending on how Python was compiled (I know, really?!?!) It could be encoded / decoded in various encodings for IO and interaction with other systems. > In Python 3 there is only str, and by default it holds UTF-8 unicode Almost right: the Py3 `str` type is indeed Unicode, but it holds a sequence of Unicode code points, which are internally stored in a dynamic encoding depending on the content of the string (really! a very cool optimization, actually, if you have only ascii text, it will use only one byte per char https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden from the user. To the user, a `str` is a sequence of characters from the entire Unicode set, very simply. (Unicode is particularly weird in that one "code point" is not always one character, or "grapheme" to accommodate languages with more complex systems of combining characters, etc, but I digress..) And there are still two types -- in Python3 there is the "bytes" type, which is actually very similar to the old python2 string type -- but intended to hold arbitrary binary data, rather than text. But text is binary data, so it can still hold that. In fact, if you encode a string, you get a bytes object: ``` In [13]: s Out[13]: 'some text' In [14]: b = s.encode("ascii") In [15]: b Out[15]: b'some text' ``` Note the little 'b' before the quote. In that case, they look almost identical, as I encoded in ASCII. But what if I had some non-ASCII text?: ``` In [18]: s = "temp = 10\u00B0" In [19]: s Out[19]: 'temp = 10°' In [20]: b = s.encode("ascii") --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-20-3930abba6989> in <module> ----> 1 b = s.encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: ordinal not in range(128) ``` oops, can't do that -- the degree symbol is not part of ASCII. But I can do utf-8: ``` In [21]: b = s.encode("utf-8") In [22]: b Out[22]: b'temp = 10\xc2\xb0' ``` which now displays the byte values, escaping the non-ascii ones. So that bytes object is what would get written to a netcdf file, or any other binary file. And Python can just as easily encode that text in any supported encoding, of which there are many: ``` In [28]: s.encode("utf-16") Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00' ``` But please don't use that one! So anyway, the relevant point here is that there is NOTHING special about utf-8 as far as Python is concerned. And in fact, Python is well suited to handle pretty much any encoding folks choose to use -- but it doesn't help a bit with the fundamental problem that you need to know what the encoding of your data is in in order to use it. And if Python software (like any other) is going to write a netcdf file with non-ascii text in it, it needs to know what encoding to use. The other complication that has come up here is that, IIUC, the netCDF4 Python library (A wrapper around the c libnetcdf) I think makes no distinction between the netcdf types CHAR and STRING (don't quote me on that), but that's a decision of the library authors, not a limitation of Python. Actually, it does seem to give the user some control: https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring Note that utf-8 is the default, but you can do whatever you want. In any case, the Python libraries can be made to work with anything reasonable CF decides, even if I have to write the PRs myself :-) Sorry to be so long winded, but this IS confusing stuff! -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/cf-convention/cf-conventions/issues/141#issuecomment-599005152 This list forwards relevant notifications from Github. It is distinct from cf-metad...@cgd.ucar.edu, although if you do nothing, a subscription to the UCAR list will result in a subscription to this list. To unsubscribe from this list only, send a message to cf-metadata-unsubscribe-requ...@listserv.llnl.gov.