I hope you didn't mean to take this off-list:
On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer <n...@arctrix.com> wrote:

> In gmane.comp.python.devel, you wrote:
> > For the record, we've got a pretty good thread (not this good, though!)
> > over on the numpy list about how to untangle the mess that has resulted
>


> Not sure about your definition of good. ;-)


well, in the sense of "big" anyway...


>  Could you summarize the main points on python-dev?  I'm not feeling up to
> wading through
> another massive thread but I'm quite interested to hear the
> challenges that numpy deals with.


Well, not much new to it, really. But here's a re-cap:

numpy has had an 'S' dtype for a while, which corresponded to the py2
string type (except for being fixed length). So it could auto-convert
to-from python strings... all was good and happy.

Enter py3: what to do? there is no py2 string type anymore. So it was
decided to have the 'S' dtype correspond to the py3 bytes
type. Apparently there was thought of renaming it, but the 'B' and 'b'
type identifiers were already takes, so 'S' was kept.

However, as we all know in this thread, the py3 bytes type is not the same
thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S'
type for text data -- so that is kind of broken in py3.

However, other folks use the 'S' type for binary data, so like (and rely
on) it being mapped to the py3 bytes type. So we are stuck with that.

Given the nature of numpy, and scientific data, there is talk of having a
one-byte-per-char text type in numpy (there is already a unicode type, but
it uses 4-bytes-per-char, as it's key to the numpy data model that all
objects of a given type are the same size.) This would be analogous to the
current multiple precision options for numbers. It would take up less
memory, and would not be able to hold all values. It's not clear what the
level of support is for this right now -- after all, you can do everything
you need to do with the appropriate calls to encode() and decode(), if a
bit awkward.

Meanwhile, back at the ranch -- related, but separate issues
have arisen with the functions that parse text files: numpy.loadtxt and
numpy.genfromtxt. These functions were adapted for py3 just enough to get
things to mostly work, but have some serious limitations when doing
anything with unicode -- and in fact do some weird things with plain ascii
text files if you ask it to create unicode objects, and that is a natural
thing to do (and the "right" thing to do in the Py3 text model) if you do
something like:

arr = loadtxt('a_file_name', dtype=str)

on py3, an str is a py3unicode string, so you get the numpy 'U' datatype
but loadtxt wasn't designed to deal with that, so you can get stuff like:

["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
 "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
 "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]

This was (Presumably, I haven't debugged the code) due to conversion from
bytes to unicode...(I'm still confused about the extra slashes)

And this ascii text -- it gets worse if there is non-ascii text in there.

Anyway, the truth is, this stuff is hard, but it will get at least a touch
easier with PEP 461.

[though to be truthful, I'm not sure why someone put a comment in the issue
tracker about b'%d'%some_num being an issue ... I'm not sure how when we're
going from text to numbers, not the other way around...]

-Chris

























-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to