On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote: > On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.har...@gmail.com > wrote: > > On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <n...@pobox.com> wrote: > >> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris > >> <charlesr.har...@gmail.com> wrote: > >> > > >> > I didn't say we should change the S type, but that we should have > >> something, > >> > say 's', that appeared to python as a string. I think if we want > >> transparent > >> > string interoperability with python together with a compressed > >> > representation, and I think we need both, we are going to have to deal > >> with > >> > the difficulties of utf-8. That means raising errors if the string > >> doesn't > >> > fit in the allotted size, etc. Mind, this is a workaround for the mass > >> of > >> > ascii data that is already out there, not a substitute for 'U'. > >> > >> If we're going to be taking that much trouble, I'd suggest going ahead > >> and adding a variable-length string type (where the array itself > >> contains a pointer to a lookaside buffer, maybe with an optimization > >> for stashing short strings directly). The fixed-length requirement is > >> pretty onerous for lots of applications (e.g., pandas always uses > >> dtype="O" for strings -- and that might be a good workaround for some > >> people in this thread for now). The use of a lookaside buffer would > >> also make it practical to resize the buffer when the maximum code > >> point changed, for that matter... > >> > The more I think about it, the more I think we may need to do that. Note > that dynd has ragged arrays and I think they are implemented as pointers to > buffers. The easy way for us to do that would be a specialization of object > arrays to string types only as you suggest.
This wouldn't necessarily help for the gigarows of short text strings use case (depending on what "short" means). Also even if it technically saves memory you may have a greater overhead from fragmenting your array all over the heap. On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory saving over dtype='U' only if the strings are 17 characters or more. To get a 50% saving over dtype='U' you'd need strings of at least 49 characters. If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. Using utf-8 in the buffers eliminates the need to go around checking maximum code points etc. so I would guess that would be simpler to implement (CPython has now had to triple all of it's code paths that actually access the string buffer). Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion