Oscar, Cool stuff, thanks!
I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. To make a slightly more concrete proposal, I've implemented a pure > Python ndarray subclass that I believe can consistently handle > text/bytes in Python 3. this scares me right there -- is it text or bytes??? We really don't want something that is both. > The idea is that the array has an encoding. It stores strings as > bytes. The bytes are encoded/decoded on insertion/access. Methods > accessing the binary content of the array will see the encoded bytes. > Methods accessing the elements of the array will see unicode strings. > > I believe it would not be as hard to implement as the proposals for > variable length string arrays. except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users "why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?) > The one caveat is that it will strip > null characters from the end of any string. which is fatal, but you do want a new dtype after all, which presumably wouldn't do that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion