On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote:
> > > > On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < > oscar.j.benja...@gmail.com> wrote: > >> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: >> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin >> > <oscar.j.benja...@gmail.com>wrote: >> > > How significant are the performance issues? Does anyone really use >> numpy >> > > for >> > > this kind of text handling? If you really are operating on gigantic >> text >> > > arrays of ascii characters then is it so bad to just use the bytes >> dtype >> > > and >> > > handle decoding/encoding at the boundaries? If you're not operating on >> > > gigantic text arrays is there really a noticeable problem just using >> the >> > > 'U' >> > > dtype? >> > > >> > >> > I use numpy for giga-row arrays of short text strings, so memory and >> > performance issues are real. >> > >> > As discussed in the previous parent thread, using the bytes dtype is >> really >> > a problem because users of a text array want to do things like filtering >> > (`match_rows = text_array == 'match'`), printing, or other >> manipulations in >> > a natural way without having to continually use bytestring literals or >> > `.decode('ascii')` everywhere. I tried converting a few packages while >> > leaving the arrays as bytestrings and it just ended up as a very big >> mess. >> > >> > From my perspective the goal here is to provide a pragmatic way to allow >> > numpy-based applications and end users to use python 3. Something like >> > this proposal seems to be the right direction, maybe not pure and >> perfect >> > but a sensible step to get us there given the reality of scientific >> > computing. >> >> I don't really see how writing b'match' instead of 'match' is that big a >> deal. >> > > It's a big deal because all your existing python 2 code suddenly breaks on > python 3, even after running 2to3. Yes, you can backfix all the python 2 > code and use bytestring literals everywhere, but that is very painful and > ugly. More importantly it's very fiddly because *sometimes* you'll need to > use bytestring literals, and *sometimes* not, depending on the exact > dataset you've been handed. That's basically a non-starter. > > As you say below, the only solution is a proper separation of > bytes/unicode where everything internally is unicode. The problem is that > the existing 4-byte unicode in numpy is a big performance / memory hit. > It's even trickier because libraries will happily deliver a numpy > structured array with an 'S'-dtype field (from a binary dataset on disk), > and it's a pain to then convert to 'U' since you need to remake the entire > structured array. With a one-byte unicode the goal would be an in-place > update of 'S' to 's'. > > >> And why are you needing to write .decode('ascii') everywhere? > > > >>> print("The first value is {}".format(bytestring_array[0])) > > On Python 2 this gives "The first value is string_value", while on Python > 3 this gives "The first value is b'string_value'". > As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion