I don't like "ASCII" because it only applies to 7 bits even though chars have 8 bits. So specifying "ASCII" still leaves ambiguity if any of the chars have the 8th bit set. The file writer may know the variable will only have 7 bit values, but it is safer for the reader to read the variable with a decoder that handles 8 bit values. "ASCII" is trouble, so there is no reason to encourage it, especially when there are compatible alternatives like ISO-8859-1.
I do like ISO-8859-1, because * It is compatible with ASCII for chars 0-127, which is all that ASCII specifies. * Any variable that has just 7bit ASCII chars can be labelled "charset=ISO-8859-1". * It is the most commonly used single-page 8bit charset for supporting the European languages. * It is widely used and supported. I do like UTF-8 because it is the only charset that supports full Unicode (all UTF-16/UCS-4/UTF-32 characters) in an 8bit encoding (since that is all we have for characters in netcdf-3 files: 8bit chars). And it is incredibly widely used and supported in software. UTF-16/UTF-32/UCS-4 are not possible options because netcdf-3 files only have an 8bit char data type, not 16 or 32bit chars. If we want to support more than 255 different characters in a given char variable, UTF-8 is really the only option (which is fine because it is a good option). So my proposal is: charset can specify any single-page (8bit) character set, but the two recommended charsets would be "ISO-8859-1" (for most simple cases) and "UTF-8" (for harder cases / full Unicode). On Wed, Feb 22, 2017 at 11:06 AM, Chris Barker <[email protected]> wrote: > On Wed, Feb 22, 2017 at 10:38 AM, Bob Simons - NOAA Federal < > [email protected]> wrote: > >> As for needing a different subject for the email: I'm lumping together 2 >> new related attribute names: "charset=..." and "data_type=string|char" so >> that the information stored in char variables in netcdf-3 files can be >> easily and unambiguously interpreted. >> > > somehow it got smashed in with the thread about geometries.. maybe that > was my email client. But anyway, away we go! > > >> You are correct. My proposal is for netcdf-3 files since they only >> support chars, not true strings. >> > > so maybe make it clear that for netcdf4, one should use strings? I'm not > sure if there is anything in CF now that is 3 vs 4 specific... > > >> As for "encoding" vs "charset", I'm open to different names. I chose >> "charset" because that is the name used in HTML and is widely used in other >> places. Yes, XML uses "encoding". To me, the word "charset" seems >> preferable because it is more specific than "encoding" (which also has a >> more general purpose meaning). >> > > not a biggie -- +0 for encoding from me. > > >> As for full Unicode support via UTF-8 vs UTF-16: >> > > well, UTF-16 is the worst option -- let's never use that! UCS-4 is the way > to go if you want full unicode support and constant bytes per charactor. > though "wastes" space. > > >> Since netcdf-3 only supports 8bit chars, the 16bit UTF-16 is not an >> option. >> > > well, sure, but at the binary level a CHAR is simply an unsigned 8-bit > integer -- so you could stuff any encoding into an array of CHAR. > > But UTF-8 is the only way I know of to support full Unicode using only >> 8bit chars for the underlying storage. >> > > see above, but: > > >> It is very widely used. Every modern piece of software that can read or >> write text files supports it. It is the default for both XML and HTML 5. >> > > yeah, it really is the best compromise -- and becoming the universal form > for data interchange. > > >> If the file writer doesn't need full Unicode, they can use "ISO-8859-1" >> (which is compatible with 7bit ASCII) >> > > I'd vote for ASCII and ISO-8859-1 as the only options (Or the HIGHLY > RECOMMENDED options, at least). > > -CHB > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > [email protected] > -- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
_______________________________________________ CF-metadata mailing list [email protected] http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
