On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas <aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin > <oscar.j.benja...@gmail.com> wrote: >> >> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: >> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin >> > <oscar.j.benja...@gmail.com>wrote: >> > > How significant are the performance issues? Does anyone really use >> > > numpy >> > > for >> > > this kind of text handling? If you really are operating on gigantic >> > > text >> > > arrays of ascii characters then is it so bad to just use the bytes >> > > dtype >> > > and >> > > handle decoding/encoding at the boundaries? If you're not operating on >> > > gigantic text arrays is there really a noticeable problem just using >> > > the >> > > 'U' >> > > dtype? >> > > >> > >> > I use numpy for giga-row arrays of short text strings, so memory and >> > performance issues are real. >> > >> > As discussed in the previous parent thread, using the bytes dtype is >> > really >> > a problem because users of a text array want to do things like filtering >> > (`match_rows = text_array == 'match'`), printing, or other manipulations >> > in >> > a natural way without having to continually use bytestring literals or >> > `.decode('ascii')` everywhere. I tried converting a few packages while >> > leaving the arrays as bytestrings and it just ended up as a very big >> > mess. >> > >> > From my perspective the goal here is to provide a pragmatic way to allow >> > numpy-based applications and end users to use python 3. Something like >> > this proposal seems to be the right direction, maybe not pure and >> > perfect >> > but a sensible step to get us there given the reality of scientific >> > computing. >> >> I don't really see how writing b'match' instead of 'match' is that big a >> deal. > > > It's a big deal because all your existing python 2 code suddenly breaks on > python 3, even after running 2to3. Yes, you can backfix all the python 2 > code and use bytestring literals everywhere, but that is very painful and > ugly. More importantly it's very fiddly because *sometimes* you'll need to > use bytestring literals, and *sometimes* not, depending on the exact dataset > you've been handed. That's basically a non-starter. > > As you say below, the only solution is a proper separation of bytes/unicode > where everything internally is unicode. The problem is that the existing > 4-byte unicode in numpy is a big performance / memory hit. It's even > trickier because libraries will happily deliver a numpy structured array > with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to > then convert to 'U' since you need to remake the entire structured array. > With a one-byte unicode the goal would be an in-place update of 'S' to 's'. > >> >> And why are you needing to write .decode('ascii') everywhere? > > >>>> print("The first value is {}".format(bytestring_array[0])) > > On Python 2 this gives "The first value is string_value", while on Python 3 > this gives "The first value is b'string_value'".
Unfortunately (?) setprintoptions and set_string_function don't work with numpy scalars AFAICS. If it did then it would be possible to override the string representation. It works for arrays. I didn't find the right key for numpy.bytes_ on python 3.3 so now my interpreter can only print bytes np.set_printoptions(formatter={'all':lambda x: x.decode('ascii',errors="ignore") }) Josef > >> >> If you really >> do just want to work with bytes in your own known encoding then why not >> just >> read and write in binary mode? >> >> I apologise if I'm wrong but I suspect that much of the difficulty in >> getting >> the bytes/unicode separation right is down to the fact that a lot of the >> code >> you're using (or attempting to support) hasn't yet been ported to a clean >> text >> model. When I started using Python 3 it took me quite a few failed >> attempts >> at understanding the text model before I got to the point where I >> understood >> how it is supposed to be used. The problem was that I had been conflating >> text >> and bytes in many places, and that's hard to disentangle. Having fixed >> most of >> those problems I now understand why it is such an improvement. >> >> In any case I don't see anything wrong with a more efficient dtype for >> representing text if the user can specify the encoding. The problem is >> that >> numpy arrays expose their underlying memory buffer. Allowing them to >> interact >> directly with text strings on the one side and binary files on the other >> breaches Python 3's very good text model unless the user can specify the >> encoding that is to be used. Or at least if there is to be a blessed >> encoding >> then make it unicode-capable utf-8 instead of legacy ascii/latin-1. >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion