The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? A key consequence of not having a one-byte string dtype is that handling ASCII data stored in binary formats such as HDF or FITS is basically broken in Python 3. Packages like h5py, pytables, and astropy.io.fits all return text data arrays with the numpy 'S' type, and in fact have no direct support for the numpy wide unicode 'U' type. In Python 3, the 'S' type array cannot be compared with the Python str type, so that something like below fails: >>> mask = (names_array == "john") # FAIL Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. For a good top-level summary of much of the previous thread discussion, see [5] from Chris Barker. Condensing this down to just a few points: - *Changing* the behavior of the existing 'S' type is going to break code and seems a bad idea. - *Adding* a new dtype 's' will work and allow highly performant conversion from 'S' to 's' via view(). - Using the latin-1 encoding will minimize code breakage vis-a-vis what works in Python 2 [6]. Using latin-1 is a pragmatic compromise that provides continuity to allow scientists to run their existing code in Python 3 and have things just work. It isn't perfect and it should not be the end of the story, but it would be good. This single issue is the *only* thing blocking me and my team from using Python 3 in operations. As a final point, I don't know the numpy internals at all, but it *seems* like this proposal is one of the easiest to implement amongst those that were discussed. Cheers, Tom [1]: http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html [2]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html [3]: https://github.com/astropy/astropy/issues/3311 [4]: http://astropy.readthedocs.org/en/latest/api/astropy.table.Table.html#astropy.table.Table.convert_bytestring_to_unicode [5]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070631.html [6]: It is not uncommon to store uint8 data in a bytestring
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion