Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

Lluís Vilanova Wed, 14 Sep 2016 07:39:17 -0700

Stephan Hoyer writes:

> On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova <vilan...@ac.upc.edu> wrote:
>     Whenever we repr an array using 'S', we can instead show a unicode in py3.
>     That
>     keeps the binary representation, but will always show the expected result 
> to
>     users, and it's only a handful of lines added to dump_data().
    
>     If needed, I could easily add a bytes array to make the alternative 
> explicit
>     (where py3 would repr the contents as b'foo').
    
>     This would only leave the less-common paths inconsistent across python
>     versions,
>     which should not be a problem for most examples/doctests:
    
>     * A 'U' array will show u'foo' in py2 and 'foo' in py3.
>     * The new binary array will show 'foo' in py2 and b'foo' in py3 (that 
> could
>     also
>     be patched on the repr code).
>     * A 'O' array will not be able to do any meaningful repr conversions.
    
    
>     A more complex alternative (and actually closer to what I'm proposing) is 
> to
>     modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
>     string. It would have the binary compatibility, while being a unicode 
> string
>     in
>     practice.


> I'm afraid these are both also non-starters at this point. NumPy's string 
> dtype
> corresponds to bytes on Python 3, and you can use it to store arbitrary binary
> values. Would it really be an improvement to change the repr, if the scalar
> value resulting from indexing is still bytes?


> The sanest approach is probably a new dtype for one-byte strings. We talked
> about this a few years ago, but nobody has implemented it yet:
> http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

From the ref manual, 'S' is a "(byte-)string", which (to me) should never have
non-printable characters. That's why I'm advocating "S" to be your proposed
one-byte strings, while a new "B" dtype is needed for arbitrary binary arrays.
This has the added benefit of making docstrings correct on both py2 and py3.

But I won't keep pushing for this; I understand the backwards-compatibility
issues mentioned before. Maybe "S" should just be deprecated, "s" (as the
one-byte strings) and "B" added instead, and all docstrings and tests changed to
"s".

In any case, after reading the whole thread, it's not clear to me what's the
consensus on what the solution should be (Chris's summary is the closest thing
to that).

Cheers,
  Lluis
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

Reply via email to