On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benja...@gmail.com > wrote:
> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: > > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin > > <oscar.j.benja...@gmail.com>wrote: > > > How significant are the performance issues? Does anyone really use > numpy > > > for > > > this kind of text handling? If you really are operating on gigantic > text > > > arrays of ascii characters then is it so bad to just use the bytes > dtype > > > and > > > handle decoding/encoding at the boundaries? If you're not operating on > > > gigantic text arrays is there really a noticeable problem just using > the > > > 'U' > > > dtype? > > > > > > > I use numpy for giga-row arrays of short text strings, so memory and > > performance issues are real. > > > > As discussed in the previous parent thread, using the bytes dtype is > really > > a problem because users of a text array want to do things like filtering > > (`match_rows = text_array == 'match'`), printing, or other manipulations > in > > a natural way without having to continually use bytestring literals or > > `.decode('ascii')` everywhere. I tried converting a few packages while > > leaving the arrays as bytestrings and it just ended up as a very big > mess. > > > > From my perspective the goal here is to provide a pragmatic way to allow > > numpy-based applications and end users to use python 3. Something like > > this proposal seems to be the right direction, maybe not pure and perfect > > but a sensible step to get us there given the reality of scientific > > computing. > > I don't really see how writing b'match' instead of 'match' is that big a > deal. > It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter. As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'. > And why are you needing to write .decode('ascii') everywhere? >>> print("The first value is {}".format(bytestring_array[0])) On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'". > If you really > do just want to work with bytes in your own known encoding then why not > just > read and write in binary mode? > > I apologise if I'm wrong but I suspect that much of the difficulty in > getting > the bytes/unicode separation right is down to the fact that a lot of the > code > you're using (or attempting to support) hasn't yet been ported to a clean > text > model. When I started using Python 3 it took me quite a few failed attempts > at understanding the text model before I got to the point where I > understood > how it is supposed to be used. The problem was that I had been conflating > text > and bytes in many places, and that's hard to disentangle. Having fixed > most of > those problems I now understand why it is such an improvement. > > In any case I don't see anything wrong with a more efficient dtype for > representing text if the user can specify the encoding. The problem is that > numpy arrays expose their underlying memory buffer. Allowing them to > interact > directly with text strings on the one side and binary files on the other > breaches Python 3's very good text model unless the user can specify the > encoding that is to be used. Or at least if there is to be a blessed > encoding > then make it unicode-capable utf-8 instead of legacy ascii/latin-1. > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion