On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.har...@gmail.com > wrote:
> > > > On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > >> >> >> >> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris >>> <charlesr.har...@gmail.com> wrote: >>> > >>> > >>> > >>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin < >>> oscar.j.benja...@gmail.com> >>> > wrote: >>> >> >>> >> >>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" < >>> charlesr.har...@gmail.com> >>> >> wrote: >>> >> > >>> >> > I think we may want something like PEP 393. The S datatype may be >>> the >>> >> > wrong place to look, we might want a modification of U instead so >>> as to >>> >> > transparently get the benefit of python strings. >>> >> >>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than >>> it >>> >> does for numpy arrays for two reasons: str is immutable and opaque. >>> >> >>> >> Since str is immutable the maximum code point in the string can be >>> >> determined once when the string is created before anything else can >>> get a >>> >> pointer to the string buffer. >>> >> >>> >> Since it is opaque no one can rightly expect it to expose a particular >>> >> binary format so it is free to choose without compromising any >>> expected >>> >> semantics. >>> >> >>> >> If someone can call buffer on an array then the FSR is a semantic >>> change. >>> >> >>> >> If a numpy 'U' array used the FSR and consisted only of ASCII >>> characters >>> >> then it would have a one byte per char buffer. What then happens if >>> you put >>> >> a higher code point in? The buffer needs to be resized and the data >>> copied >>> >> over. But then what happens to any buffer objects or array views? >>> They would >>> >> be pointing at the old buffer from before the resize. Subsequent >>> >> modifications to the resized array would not show up in other views >>> and vice >>> >> versa. >>> >> >>> >> I don't think that this can be done transparently since users of a >>> numpy >>> >> array need to know about the binary representation. That's why I >>> suggest a >>> >> dtype that has an encoding. Only in that way can it consistently have >>> both a >>> >> binary and a text interface. >>> > >>> > >>> > I didn't say we should change the S type, but that we should have >>> something, >>> > say 's', that appeared to python as a string. I think if we want >>> transparent >>> > string interoperability with python together with a compressed >>> > representation, and I think we need both, we are going to have to deal >>> with >>> > the difficulties of utf-8. That means raising errors if the string >>> doesn't >>> > fit in the allotted size, etc. Mind, this is a workaround for the mass >>> of >>> > ascii data that is already out there, not a substitute for 'U'. >>> >>> If we're going to be taking that much trouble, I'd suggest going ahead >>> and adding a variable-length string type (where the array itself >>> contains a pointer to a lookaside buffer, maybe with an optimization >>> for stashing short strings directly). The fixed-length requirement is >>> pretty onerous for lots of applications (e.g., pandas always uses >>> dtype="O" for strings -- and that might be a good workaround for some >>> people in this thread for now). The use of a lookaside buffer would >>> also make it practical to resize the buffer when the maximum code >>> point changed, for that matter... >>> >> > The more I think about it, the more I think we may need to do that. Note > that dynd has ragged arrays and I think they are implemented as pointers to > buffers. The easy way for us to do that would be a specialization of object > arrays to string types only as you suggest. > Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that? - Tom > > <snip> > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion