On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas < [email protected]> wrote:
> > > > On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris < > [email protected]> wrote: > >> >> >> >> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas < >> [email protected]> wrote: >> >>> >>> >>> >>> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris < >>> [email protected]> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <[email protected]>wrote: >>>>> >>>>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>>>>> <[email protected]> wrote: >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>>>>> [email protected]> >>>>>> > wrote: >>>>>> >> >>>>>> >> >>>>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>>>>> [email protected]> >>>>>> >> wrote: >>>>>> >> > >>>>>> >> > I think we may want something like PEP 393. The S datatype may >>>>>> be the >>>>>> >> > wrong place to look, we might want a modification of U instead >>>>>> so as to >>>>>> >> > transparently get the benefit of python strings. >>>>>> >> >>>>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str >>>>>> than it >>>>>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>>>>> >> >>>>>> >> Since str is immutable the maximum code point in the string can be >>>>>> >> determined once when the string is created before anything else >>>>>> can get a >>>>>> >> pointer to the string buffer. >>>>>> >> >>>>>> >> Since it is opaque no one can rightly expect it to expose a >>>>>> particular >>>>>> >> binary format so it is free to choose without compromising any >>>>>> expected >>>>>> >> semantics. >>>>>> >> >>>>>> >> If someone can call buffer on an array then the FSR is a semantic >>>>>> change. >>>>>> >> >>>>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>>>>> characters >>>>>> >> then it would have a one byte per char buffer. What then happens >>>>>> if you put >>>>>> >> a higher code point in? The buffer needs to be resized and the >>>>>> data copied >>>>>> >> over. But then what happens to any buffer objects or array views? >>>>>> They would >>>>>> >> be pointing at the old buffer from before the resize. Subsequent >>>>>> >> modifications to the resized array would not show up in other >>>>>> views and vice >>>>>> >> versa. >>>>>> >> >>>>>> >> I don't think that this can be done transparently since users of a >>>>>> numpy >>>>>> >> array need to know about the binary representation. That's why I >>>>>> suggest a >>>>>> >> dtype that has an encoding. Only in that way can it consistently >>>>>> have both a >>>>>> >> binary and a text interface. >>>>>> > >>>>>> > >>>>>> > I didn't say we should change the S type, but that we should have >>>>>> something, >>>>>> > say 's', that appeared to python as a string. I think if we want >>>>>> transparent >>>>>> > string interoperability with python together with a compressed >>>>>> > representation, and I think we need both, we are going to have to >>>>>> deal with >>>>>> > the difficulties of utf-8. That means raising errors if the string >>>>>> doesn't >>>>>> > fit in the allotted size, etc. Mind, this is a workaround for the >>>>>> mass of >>>>>> > ascii data that is already out there, not a substitute for 'U'. >>>>>> >>>>>> If we're going to be taking that much trouble, I'd suggest going ahead >>>>>> and adding a variable-length string type (where the array itself >>>>>> contains a pointer to a lookaside buffer, maybe with an optimization >>>>>> for stashing short strings directly). The fixed-length requirement is >>>>>> pretty onerous for lots of applications (e.g., pandas always uses >>>>>> dtype="O" for strings -- and that might be a good workaround for some >>>>>> people in this thread for now). The use of a lookaside buffer would >>>>>> also make it practical to resize the buffer when the maximum code >>>>>> point changed, for that matter... >>>>>> >>>>> >>>> The more I think about it, the more I think we may need to do that. >>>> Note that dynd has ragged arrays and I think they are implemented as >>>> pointers to buffers. The easy way for us to do that would be a >>>> specialization of object arrays to string types only as you suggest. >>>> >>> >>> Is this approach intended to be in *addition to* the latin-1 "s" type >>> originally proposed by Chris, or *instead of* that? >>> >>> >> Well, that's open for discussion. The problem is to have something that >> is both compact (latin-1) and interoperates transparently with python 3 >> strings (utf-8). A latin-1 type would be easier to implement and would >> probably be a better choice for something available in both python 2 and >> python 3, but unless the python 3 developers come up with something clever >> I don't see how to make it behave transparently as a string in python 3. >> OTOH, it's not clear to me how to make utf-8 operate transparently with >> python 2 strings, especially as the unicode representation choices in >> python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 >> is unlikely to be backported. The problem may be unsolvable in a completely >> satisfactory way. >> > > Since it's open for discussion, I'll put in my vote for implementing the > easier latin-1 version in the short term to facilitate Python 2 / 3 > interoperability. This would solve my use-case (giga-rows of short fixed > length strings), and presumably allow things like memory mapping of large > data files (like for FITS files in astropy.io.fits). > > I don't have a clue how the current 'U' dtype works under the hood, but > from my user perspective it seems to work just fine in terms of interacting > with Python 3 strings. Is there a technical problem with doing basically > the same thing for an 's' dtype, but using latin-1 instead of UCS-4? > I think there is a technical problem. We may be able masquerade latin-1 as utf-8 for some subset of characters or fool python 3 in some other way. But in anycase, I think it needs some research to see what the possibilities are. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
