Hi all, I tried to post this note last week, in response to the TRAC ticket, but it doesn't seem to have gone through. Sorry if this is a repeat.
Note that it seems Bob has lost momentum on this one, and closed the ticket. However, the fact that the OP is dropping out doesn't mean it's still not a good idea. For my part, I think it IS a good idea, though I'm also not motivated enough to push it through. Hopefully someone is motivated enough to iron out the last details -- I think we are close. TL;DR: It is clear (to me, anyway) that an array of chars and an array of string are different things, so it makes enormous sense for CF to have a way to clearly specify the distinction. However, it is not clear whether there are enough use-cases in the wild (or future?) that use arrays of chars as arrays of char, rather than as strings. If there are not, then there isn't much point this thi proposal (though not much of a downside, either....) What I intended to post last week: I'm not sure if I can comment on a TRAC ticket (I don't seem to be able to) so I'm putting this note here. I think _Encoding is good. I've just consulted the netCDF user guide, and > I see they don't include _Encoding as one of their attribute conventions > there. Yet the use of the underscore should imply it means something > special to the netCDF library, according to their conventions Ahh! I was wondering about that. Some searching has revealed: > """ > Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations. > """ > in: > http://www.unidata.ucar.edu/software/netcdf/netcdf-4/ newdocs/netcdf/Classic-Format-Spec.html > So I think yes, _Encoding is special to netcdf, and thus the correct spelling. I see you've combined your two tickets, and the choice between charset or > _Encoding indicates whether it's char or string data. I'm not convinced > still that we need this distinction. On the email list we are discussing > Example H4. IIUC, that was simply an example ( and I have not been following the discussion ) of an ambiguous case -- not the only driver behind this idea. > Are there any other cases > where CF is ambiguous about whether a variable is a char array or a > string? > It seems patently obvious to me that if a CHAR is a data type, then an ND array of char type is a perfectly reasonable entity to use. And if there is no STRING data type then a file reader has no idea whether a 2D array of char is actually a 2D array of scalar char types or a 1D array of strings, and yes, they would be read and used differently. This gets particularly tricky if you want to convert from netcdf3 (no string type) to netcdf4(string type) -- do you convert the char array to a string type? This is not a specious example -- if I read a netcdf file with an intelligent reader, I will likely convert a ND char array to a (N-1)D array of strings in the "native" format. (say a numpy array of strings). Then if I write that array out to a netcdf4 file, it would get written as a String array. If the char array was intended to be an array of Strings, this is great. If it was intended to be an array of individual chars, then I will have just inadvertently changed the semantics of the data. There is also the issue of specifying an encoding -- if you want to specify an encoding for those chars without turning them into strings -- what do you do? All this being said -- the key question remains: Are there any files out in the wild that DO use ND arrays of NC_CHAR that are not intended to be interpreted as a (N-1)D array of Strings? Prior art: I just whipped up a little Python script to play with this (see enclosed). If I create a netcdf3 file with a (6,4) array of char, then ncdump it, it looks like this: $ ncdump char_test.nc netcdf char_test { dimensions: first = 6 ; second = 4 ; variables: char a_char_array(first, second) ; data: a_char_array = "ABCD", "EFGH", "IJKL", "MNOP", "QRST", "UVWX" ; } So ncdump (i.e. CDL) is pretty much interpreting it as an array of strings. Actually,l not quite -- it is a 2D array of CHAR that you "write" as 1D array of strings... However, if you load it via the Python netCDF4library, you get a 2D array of individual characters. Size of array is: (6, 4) datatype of array is: |S1 contents of array is: [['A' 'B' 'C' 'D'] ['E' 'F' 'G' 'H'] ['I' 'J' 'K' 'L'] ['M' 'N' 'O' 'P'] ['Q' 'R' 'S' 'T'] ['U' 'V' 'W' 'X']] (note that numpy does not have a char type -- so it is represented as a String of length-1 (one-byte-per-char string) You convert it into an array of strings if you want: arr = np.fromstring(arr.tostring(), dtype='S%i' % arr.shape[1]) size of array is: (6,) datatype of array is: |S4 contents of array is: ['ABCD' 'EFGH' 'IJKL' 'MNOP' 'QRST' 'UVWX'] But you'd need to know the intent of the data, to know that you need to do that. I haven't looked to see what a CF-aware library (Like Iris or maybe netcdf-Java) does with 2D arrays of characters. In the end, in the Python world, we say that "explicit is better than implicit" -- so while we probably need to say the default is for a ND array of CHAR to be interpreted a (N-1)D array of strings, having a way to say "I really want this to be a char array" seems like a good idea to me -- what's the downside? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
#!/usr/bin/env python # code to test char data in netcdf import numpy as np import netCDF4 # create an example file ds = netCDF4.Dataset("char_test.nc", 'w', format='NETCDF3_CLASSIC') first = ds.createDimension("first", 6) second = ds.createDimension("second", 4) char_var = ds.createVariable("a_char_array", 'c', ('first', 'second')) for i in range(len(first)): for j in range(len(second)): char_var[i, j] = chr(65 + (i * len(second)) + j) ds.close() # read it: ds = netCDF4.Dataset("char_test.nc") var = ds.variables['a_char_array'] print "the netCDF4 variable object:" print var arr = var[:] print "Now a 2D array of characters" print "size of array is:", arr.shape print "datatype of array is:", arr.dtype print "contents of array is:" print arr # convert to array of strings: arr = np.fromstring(arr.tostring(), dtype='S%i' % arr.shape[1]) print print "Now a 1D array of strings" print "size of array is:", arr.shape print "datatype of array is:", arr.dtype print "contents of array is:" print arr
_______________________________________________ CF-metadata mailing list CF-metadata@cgd.ucar.edu http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata