2009/11/27 Christopher Barker <chris.bar...@noaa.gov>: > >> The point is that I don't think we can just decide to use Unicode or >> Bytes in all places where PyString was used earlier. > > Agreed.
I only half agree. It seems to me that for almost all situations where PyString was used, the right data type is a python3 string (which is unicode). I realize there may be some few cases where it is appropriate to use bytes, but I think there needs to be a compelling reason for each one. > In a way, unicode strings are a bit like arrays: they have an encoding > associated with them (like a dtype in numpy). You can represent a given > bit of text in multiple different arangements of bytes, but they are all > supposed to mean the same thing and, if you know the encoding, you can > convert between them. This is kind of like how one can represent 5 in > any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any > value represented by one dtype can be converted to all other dtypes, but > many can. Just like encodings. This is incorrect. Unicode objects do not have default encodings or multiple internal representations (within a single python interpreter, at least). Unicode objects use 2- or 4-byte internal representations internally, but this is almost invisible to the user. Encodings only become relevant when you want to convert a unicode object to a byte stream. It is usually an error to store text in a byte stream (for it to make sense you must provide some mechanism to specify the encoding). > Anyway, all this brings me to think about the use of strings in numpy in > this way: if it is meant to be a human-readable piece of text, it should > be a unicode object. If not, then it is bytes. > > So: "fromstring" and the like should, of course, work with bytes (though > maybe buffers really...) I think if you're going to call it fromstring, it should onvert from strings (i.e. unicode strings). But really, I think it makes more sense to rename it frombytes() and have it convert bytes objects. One could then have def fromstring(s, encoding="utf-8"): return frombytes(s.encode(encoding)) as a shortcut. Maybe ASCII makes more sense as a default encoding. But really, think about where the user's going to get the srting: most of the time it's coming from a disk file or a network stream, so it will be a byte string already, so they should use frombytes. >> To summarize the use cases I've ran across so far: >> >> 1) For 'S' dtype, I believe we use Bytes for the raw data and the >> interface. > > I don't think so here. 'S' is usually used to store human-readable > strings, I'd certainly expect to be able to do: > > s_array = np.array(['this', 'that'], dtype='S10') > > And I'd expect it to work with non-literals that were unicode strings, > i.e. human readable text. In fact, it's pretty rare that I'd ever want > bytes here. So I'd see 'S' mapped to 'U' here. +1 > Francesc Alted wrote: >> the next should still work: >> >> In [2]: s = np.array(['asa'], dtype="S10") >> >> In [3]: s[0] >> Out[3]: 'asa' # will become b'asa' in Python 3 > > I don't like that -- I put in a string, and get a bytes object back? I agree. >> In [4]: s.dtype.itemsize >> Out[4]: 10 # still 1-byte per element > > But what it the the strings passed in aren't representable in one byte > per character? Do we define "S" as only supporting ANSI-only string? > what encoding? Itemsize will change. That's fine. >> 3) Format strings >> >> a = array([], dtype=b'i4') >> >> I don't think it makes sense to handle format strings in Unicode >> internally -- they should always be coerced to bytes. > > This should be fine -- we control what is a valid format string, and > thus they can always be ASCII-safe. I have to disagree. Why should we force the user to use bytes? The format strings are just that, strings, and we should be able to supply python strings to them. Keep in mind that "coercing" strings to bytes requires extra information, namely the encoding. If you want to emulate python2's value-dependent coercion - raise an exception only if non-ASCII is present - keep in mind that python3 is specifically removing that behaviour because of the problems it caused. Anne _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion