Travis Oliphant added the comment:

On Aug 3, 2012, at 1:35 AM, Martin v. Löwis wrote:

> 
> Martin v. Löwis added the comment:
> 
>> This is a mis-understanding of what NumPy does and why.    There is  
>> a need to byte-swap only when the data is stored on disk in the  
>> reverse order from the native machine
> 
> So is there ever a need to byte-swap Unicode strings? I can see how *numeric*
> data are stored using the internal representation on disk; this is a common
> technique. For strings, there is the notion of encodings which makes the
> relationship between internal and disk representations. So if NumPy applies
> the numeric concept to string data, then this is a flaw.

Apologies for not using correct terminology.   I had to spend a lot of time 
getting to know Unicode when I wrote NumPy, but am rusty on the key points and 
so I may communicate incorrectly.   The NumPy representation of Unicode strings 
is always UTF-32BE or UTF-32LE (depending on the data-type of the array).    
The question is what to do when extracting this data into an array-scalar 
(which for Unicode objects has exactly the same representation as a 
PyUnicodeObject).  In fact, the NumPy Unicode array scalar is a C-sub-type of 
PyUnicodeObject and inherits from both the PyUnicodeObject and the NumPy 
"Character" interface --- a likely rare example of dual-inheritance at the 
C-level.  

> 
> It may be that people really do store text data in the same memory blob
> as numeric data and dump it to a file, but they really should think of this
> data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte  
> swapping.
> You can use PyUnicode_Decode to create a Unicode object given a void*,
> a length, and a codec name. The concept "native Unicode representation"
> does not exist - people use all of two-byte, four-byte and UTF-8  
> representations
> in memory, on a single processor architecture and operating system.

I understand all the representations of Unicode data.   There is, however, a 
native byte-order and that's what I was talking about. 

> 
>> The byte-swapping must be done prior to conversion to a Python  
>> Unicode-Object when selecting data out of the array.
> 
> So if the byte swapping is done before the Unicode object is created:
> why did Dave and Ondřej run into problems then?

There were at least  2 issues:   1) a bad test that was written by someone who 
didn't understand you shouldn't have "byte-swapped" unicode strings as 
"strings" and 2) a mis-understanding of what was happening going from the data 
stored in a NumPy array and the Python "scalar" object that was being created.  
 
.
Thank you for your explanations.   It's very helpful.   Also, thank you for the 
PEP and improvements in Python 3.3.   The situation is *much* nicer now as 
NumPy is doing all kinds of hackery to support both narrow and wide builds.    
This hackery could likely be improved even pre Python 3.3, but it's more clear 
how to handle the situation now in Python 3.3

> 
> ----------
> 
> _______________________________________
> Python tracker <rep...@bugs.python.org>
> <http://bugs.python.org/issue15540>
> _______________________________________

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue15540>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to