On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < [email protected]> wrote:
> > > > On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < > [email protected]> wrote: > >> >> >> >> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < >> [email protected]> wrote: >> >>> >>> >>> >>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <[email protected]> wrote: >>> >>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>>> <[email protected]> wrote: >>>> > >>>> > >>>> > >>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>>> [email protected]> >>>> > wrote: >>>> >> >>>> >> >>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>>> [email protected]> >>>> >> wrote: >>>> >> > >>>> >> > I think we may want something like PEP 393. The S datatype may be >>>> the >>>> >> > wrong place to look, we might want a modification of U instead so >>>> as to >>>> >> > transparently get the benefit of python strings. >>>> >> >>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str >>>> than it >>>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>>> >> >>>> >> Since str is immutable the maximum code point in the string can be >>>> >> determined once when the string is created before anything else can >>>> get a >>>> >> pointer to the string buffer. >>>> >> >>>> >> Since it is opaque no one can rightly expect it to expose a >>>> particular >>>> >> binary format so it is free to choose without compromising any >>>> expected >>>> >> semantics. >>>> >> >>>> >> If someone can call buffer on an array then the FSR is a semantic >>>> change. >>>> >> >>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>>> characters >>>> >> then it would have a one byte per char buffer. What then happens if >>>> you put >>>> >> a higher code point in? The buffer needs to be resized and the data >>>> copied >>>> >> over. But then what happens to any buffer objects or array views? >>>> They would >>>> >> be pointing at the old buffer from before the resize. Subsequent >>>> >> modifications to the resized array would not show up in other views >>>> and vice >>>> >> versa. >>>> >> >>>> >> I don't think that this can be done transparently since users of a >>>> numpy >>>> >> array need to know about the binary representation. That's why I >>>> suggest a >>>> >> dtype that has an encoding. Only in that way can it consistently >>>> have both a >>>> >> binary and a text interface. >>>> > >>>> > >>>> > I didn't say we should change the S type, but that we should have >>>> something, >>>> > say 's', that appeared to python as a string. I think if we want >>>> transparent >>>> > string interoperability with python together with a compressed >>>> > representation, and I think we need both, we are going to have to >>>> deal with >>>> > the difficulties of utf-8. That means raising errors if the string >>>> doesn't >>>> > fit in the allotted size, etc. Mind, this is a workaround for the >>>> mass of >>>> > ascii data that is already out there, not a substitute for 'U'. >>>> >>>> If we're going to be taking that much trouble, I'd suggest going ahead >>>> and adding a variable-length string type (where the array itself >>>> contains a pointer to a lookaside buffer, maybe with an optimization >>>> for stashing short strings directly). The fixed-length requirement is >>>> pretty onerous for lots of applications (e.g., pandas always uses >>>> dtype="O" for strings -- and that might be a good workaround for some >>>> people in this thread for now). The use of a lookaside buffer would >>>> also make it practical to resize the buffer when the maximum code >>>> point changed, for that matter... >>>> >>> >> The more I think about it, the more I think we may need to do that. Note >> that dynd has ragged arrays and I think they are implemented as pointers to >> buffers. The easy way for us to do that would be a specialization of object >> arrays to string types only as you suggest. >> > > Is this approach intended to be in *addition to* the latin-1 "s" type > originally proposed by Chris, or *instead of* that? > > Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
