On Fri, Jul 18, 2014 at 9:32 AM, Andrew Collette <andrew.colle...@gmail.com> wrote:
> >> A Latin-1 based 'a' type > >> would have similar problems. > > > > Maybe not -- latin1 is fixed width. > > Yes, Latin-1 is fixed width, but the issue is that when writing to a > fixed-width UTF8 string in HDF5, it will expand, possibly losing data. > you shouldn't do that -- I was in no way suggesting that a latin-1 string get pushed to a utf-8 array by default -- that would be a bad idea. utf-8 is a unicode encoding, it should be used for unicode. As for truncation -- that's inherent in using a fixed-width array to store a non-fixed width encoding. What I would like to avoid is a situation where a user writes a > 10-byte string from NumPy into a 10-byte space in an HDF5 dataset, and > unexpectedly loses the last few characters because of the encoding > mismatch. > Again, they shouldn't do that, they should be pushing a 10-character string into something -- and utf-8 is going to (Possible) truncate that. That's HDF/utf-8 limitation that people are going to have to deal with. I think you're suggesting that numpy follow the HDF model, so that the numpy-HDF transition can be clean and easy. However, I think that utf-8 is an inappropriate model for numpy, and that the mess of bytes to utf-8 is pyHDF's problem, not numpy's. i.e your issue above -- should users put a 10 character string into a numpy 10 byte utf-8 type and see it truncated? That's what I want to avoid. In any case, I certainly agree NumPy shouldn't be limited by the > capabilities of HDF5. There are other valuable use cases, including > access to the high-bit characters Latin-1 provides. But from a strict > compatibility standpoint, ASCII would be beneficial. > This is where I wonder about HDF's "ascii" type -- is it really ascii? Or is it that old standby one-byte-per-character-and-if-it's-ascii-we-all-know-what-it-means-but-if-it's-not-we'll-still-pass-it-around type? i.e the old char* ? In which case, you can just push a latin-1 type into and out of your HDF ascii arrays and everything will work just fine. Unless someone stores something other than latin-1 or ascii in it -- but even then, the bytes would still be preserved. This is why I see no downside to latin-1 -- if you don't use the > 127 code points, it's the same thing -- if you do, you get some extra handy characters. The only difference is that a proper ascii type would not let you store anything above 127 at all -- why restrict ourselves? And if you want utf-8 in HDF, then use a unicode array knowing that some truncation could occur, or use a byte array, and do the encoding yourself, so the user knows exactly what they are doing. [it would be nice if numpy had a pure numpy solution to encoding/decoding, though maybe it wouldn't really be any faster than going through python anyway...] -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion