Thanks for poking into this all. I've lost track a bit, but I think: The 'S' type is clearly broken on py3 (at least). I think that gives us room to change it, and backward compatibly is less of an issue because it's broken already -- do we need to preserve bug-for-bug compatibility? Maybe, but I suspect in this case, not -- the code the "works fine" on py3 with the 'S' type is probably only lucky that it hasn't encountered the issues yet.
And no matter how you slice it, code being ported to py3 needs to deal with text handling issues. But here is where we stand: The 'S' dtype: - was designed for one-byte-per-char text data. - was mapped to the py2 string type. - used the classic C null-terminated approach. - can be used for arbitrary bytes (as the py2 string type can), but not quite, as it truncates null bytes -- so it really a bad idea to use it that way. Under py3: The 'S' type maps to the py3 bytes type, because that's the closest to the py2 string type. But it also does some inconsistent things with encoding, and does treat a lot of other things as text. But the py3 bytes type does not have the same text handling as the py2 string type, so things like: s = 'a string' np.array((s,), dtype='S')[0] == s Gives you False, rather than True on py2. This is because a py3 string is translated to the 'S' type (presumable with the default encoding, another maybe not a good idea, but returns a bytes object, which does not compare true to a py3 string. YOu can work aroudn this with varios calls to encode() and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't work well with the py3 text model. The py2 => py3 transition separated bytes and strings: strings are unicode, and bytes are not to be used for text (directly). While there is some text-related functionality still in bytes, the core devs are quite clear that that is for special cases only, and not for general text processing. I don't think numpy should fight this, but rather embrace the py3 text model. The most natural way to do that is to use the existing 'U' dtype for text. Really the best solution for most cases. (Like the above case) However, there is a use case for a more efficient way to deal with text. There are a couple ways to go about that that have been brought up here: 1: have a more efficient unicode dtype: variable length, multiple encoding options, etc.... - This is a fine idea that would support better text handling in numpy, and _maybe_ better interaction with external libraries (HDF, etc...) 2: Have a one-byte-per-char text dtype: - This would be much easier to implement fit into the current numpy model, and satisfy a lot of common use cases for scientific data sets. We could certainly do both, but I'd like to see (2) get done sooner than later.... A related issue is whether numpy needs a dtype analogous to py3 bytes -- I'm still not sure of the use-case there, so can't comment -- would it need to be fixed length (fitting into the numpy data model better) or variable length, or ??? Some folks are (apparently) using the current 'S' type in this way, but I think that's ripe for errors, due to the null bytes issue. Though maybe there is a null-bytes-are-special binary format that isn't text -- I have no idea. So what do we do with 'S'? It really is pretty broken, so we have a couple choices: (1) depricate it, so that it stays around for backward compatibility but encourage people to either use 'U' for text, or one of the new dtypes that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), and use either uint8 or the new bytes dtype that is yet to be implemented. (2) fix it -- in this case, I think we need to be clear what it is: -- A one-byte-char-text type? If so, it should map to a py3 string, and have a defined encoding (ascii or latin-1, probably), or even better a settable encoding (but only for one-byte-per-char encodings -- I don't think utf-8 is a good idea here, as a utf-8 encoded string is of unknown length. (there is some room for debate here, as the 'S' type is fixed length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as long as it doesn't partially truncate in teh middle of a charactor) -- a bytes type? in which case, we should clean out all teh automatic conversion to-from text that iare in it now. I vote for it being our one-byte text type -- it almost is already, and it would make the easiest transition for folks from py2 to py3. But backward compatibility is backward compatibility. > numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that > knows what encoding to use in order to have a binary interface as > exposed by .tostring() and friends and but produce unicode strings > when indexed from Python code. Having both a text and a binary > interface to the same data implies having an encoding. I agree with Oscar here -- let's not conflate encode and decoded data -- the py3 text model is a fine one, we should work with it as much as practical. UNLESS: if we do add a bytes dtype, then it would be a reasonable use case to use it to store encoded text (just like the py3 bytes types), in which case it would be good to have encode() and decode() methods or ufuncs -- probably ufuncs. But that should be for special purpose, at the I/O interface kind of stuff. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion