On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris > <charlesr.har...@gmail.com> wrote: > > > > > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < > oscar.j.benja...@gmail.com> > > wrote: > >> > >> > >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.har...@gmail.com> > >> wrote: > >> > > >> > I think we may want something like PEP 393. The S datatype may be the > >> > wrong place to look, we might want a modification of U instead so as > to > >> > transparently get the benefit of python strings. > >> > >> The approach taken in PEP 393 (the FSR) makes more sense for str than it > >> does for numpy arrays for two reasons: str is immutable and opaque. > >> > >> Since str is immutable the maximum code point in the string can be > >> determined once when the string is created before anything else can get > a > >> pointer to the string buffer. > >> > >> Since it is opaque no one can rightly expect it to expose a particular > >> binary format so it is free to choose without compromising any expected > >> semantics. > >> > >> If someone can call buffer on an array then the FSR is a semantic > change. > >> > >> If a numpy 'U' array used the FSR and consisted only of ASCII characters > >> then it would have a one byte per char buffer. What then happens if you > put > >> a higher code point in? The buffer needs to be resized and the data > copied > >> over. But then what happens to any buffer objects or array views? They > would > >> be pointing at the old buffer from before the resize. Subsequent > >> modifications to the resized array would not show up in other views and > vice > >> versa. > >> > >> I don't think that this can be done transparently since users of a numpy > >> array need to know about the binary representation. That's why I > suggest a > >> dtype that has an encoding. Only in that way can it consistently have > both a > >> binary and a text interface. > > > > > > I didn't say we should change the S type, but that we should have > something, > > say 's', that appeared to python as a string. I think if we want > transparent > > string interoperability with python together with a compressed > > representation, and I think we need both, we are going to have to deal > with > > the difficulties of utf-8. That means raising errors if the string > doesn't > > fit in the allotted size, etc. Mind, this is a workaround for the mass of > > ascii data that is already out there, not a substitute for 'U'. > > If we're going to be taking that much trouble, I'd suggest going ahead > and adding a variable-length string type (where the array itself > contains a pointer to a lookaside buffer, maybe with an optimization > for stashing short strings directly). The fixed-length requirement is > pretty onerous for lots of applications (e.g., pandas always uses > dtype="O" for strings -- and that might be a good workaround for some > people in this thread for now). The use of a lookaside buffer would > also make it practical to resize the buffer when the maximum code > point changed, for that matter... > > Though, IMO any new dtype here would need a cleanup of the dtype code > first so that it doesn't require yet more massive special cases all > over umath.so. > Worth thinking about. As another alternative, what is the minimum we need to make a restricted encoding, say latin-1, appear transparently as a unicode string to python? I know the python folks don't like this much, but I suspect something along that line will eventually be required for the http folks. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion