** NOTE: this looks like it got tacked on to another thread -- please start a new thread for a new topic. (or gamil messed up...) **
Sorry for being dense here, but I'm confused. I see in the netCDF(4) spec: """ The atomic external types supported by the netCDF interface are: ... NC_CHAR 8-bit character byte ... NC_STRING variable length character string * ... """ So shouldn't one use a 2-D (or higher dim) array of NC_CHAR type if that's indeed what you have? Or is this about supporting netcdf3, which doesn't (I don't think) have a string type? It does have a BYTE type, which I would be inclined to use for a CHAR. But then I suppose you'd need to tell readers that it was intended to be a character... Other notes: Do folks want/need to support full Unicode characters? If so I think you'd need a 4 byte type -- cal it NC_UCHAR? -- and anything else would be variable-length, which would kind of kill the whole point of a character type... Small note: I'd prefer "encoding" to "charset" -- at least if you want to support "full" unicode, rather than only one-byte-per-char encodings. > > The only charsets which are recommended are "ISO-8859-1" and "UTF-8". >> > UTF-8 is problematic because it uses a variable number of bytes per character (codepoint?). If we want to support proper Unicode, then we need to either: use a variable-length string type (the netcdf 4 NC_STRING type?) or Use 4 bytes per char. Since UTF-* is a superset of ascii, it can be dangerous -- folks can say "this is UTF-*", and if they only happen to use the ASCII subset, al works fine, and then someone goes and tries to put a weird high-codepoint character in there, and all goes to heck. I see that netcdf4 supports UTF-8 for names within the file (variable names, dimension names, etc), but that works because the number of bytes is known and constant once created. Again, I'm maybe speaking from ignorance, I haven't dug into Unicode and CF And netcdf in any depth at all. > > --- An Example: Encoding three Strings: "It", "Book", and "5 €". > > > The Unicode code point for the Euro symbol is 20AC (in hexadecimal), >> > > which is 8364 (in decimal). >> > > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in >> hexadecimal). >> > > So a file would store these strings in a char array as: >> > > dimensions >> > > words = 3; >> > > strLen = 5; >> > > char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5 >> [E2][82][AC]"; >> > > charset = "UTF-8"; >> > this is tough -- how do you know what strLen should be? You could get UTF-8 characters chopped off if it was too short. Though I suppose that's a problem for the file writer to figure out. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [email protected]
_______________________________________________ CF-metadata mailing list [email protected] http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
