Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

Lluís Vilanova Tue, 13 Sep 2016 11:06:17 -0700

Chris Barker writes:

> We had a big long discussion about this on this list a while back (maybe 2 yrs
> ago???) please search the archives to find it. Though I'm pretty sure that we
> never did come to a conclusion. I think it stared with wanting better support
> ofr unicode in loadtxt and the like, and ended up delving into other encodings
> for the 'U' dtype, and maybe a single byte string dtype (latin-1), or maybe a
> variable-size unicode object like Py3's, or...


> However, it is absolutely a non-starter to change the binary representation of
> the 'S' type in any version of numpy. Due to the legacy of py2 (and, indeed,
> most computing environments) 'S' is a single byte string representation. And 
> the
> binary representation is often really key to numpy use.
> Period, end of story.

Great, that's the type of info I wanted to get before going forward. I guess
there's code relying on the binary representation of 'S' to do mmap's or access
the array's raw contents. Is that right?


> And that maps to a py2 string and py3 bytes object.

> py2 does, of course, have a Unicode object as well. If you want your code (and
> doctests, and ...) to be compatible, then you should probably go to Unicode
> strings everywhere. py3 now supports the u'string' no-op literal to make this
> easier.

> (though I guess the __repr__ won't tack on that 'u', which is going to be a
> problem for docstrings).

That's exactly the problem. Doing all examples and doctests with 'U' instead of
'S' will break it for py2 instead of py3.


> Note also that py3 has added more an more "string-like" support to the bytes
> object, so it's not too bad to go bytes-only.

There is a fundamental semantic difference between a string and a byte array,
that's the core of the problem.


Here's an alternative that only handles the repr. Separate fixes would be needed
for loadtxt's and genfromtxt's problems (Sevastian Berg briefly pointed at that,
but I'd like to know more).

Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().

If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').

This would only leave the less-common paths inconsistent across python versions,
which should not be a problem for most examples/doctests:

* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
  be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.


A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.


Cheers,
  Lluis
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

Reply via email to