I suggest a new data type 'text[encoding]', 'T'. 1. text can be cast to python strings via decoding.
2. Conceptually casting to python bytes first cast to a string then calls encode(); the current encoding in the meta data is used by default, but the new encoding can be overridden. I slightly favour 'T16' as a fixed size, text record backed by 16 bytes. This way over-allocation is forcefully delegated to the user, simplifying numpy array. Yu On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern <robert.k...@gmail.com> wrote: > On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer <sho...@gmail.com> wrote: >> >> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern <robert.k...@gmail.com> >> wrote: >>> >>> I don't know of a format off-hand that works with numpy uniform-length >>> strings and Unicode as well. HDF5 (to my recollection) supports arrays of >>> NULL-terminated, uniform-length ASCII like FITS, but only variable-length >>> UTF8 strings. >> >> >> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and >> variable length versions: >> https://github.com/PyTables/PyTables/issues/499 >> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html >> >> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for >> storage, not the number of characters. > > Ah, okay, I was interpolating from a quick perusal of the h5py docs, which > of course are also constrained by numpy's current set of dtypes. The > NULL-terminated ASCII works well enough with np.string's semantics. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion