[Numpy-discussion] Unicode revisited

Travis Oliphant Fri, 03 Aug 2012 18:03:28 -0700

Hey all, 

Ondrej has been working hard with feedback from many others on improving 
Unicode support in NumPy (especially for Python 3.3).   Looking at what Python 
has done in Python 3.3 (PEP 393) and chatting on the Python issue tracker with 
the author of that PEP has made me wonder if we aren't "doing the wrong thing" 
in NumPy quite often.


Basically, NumPy only supports UTF-32 in it's Unicode representation.   All 
bytes in NumPy arrays should be either UTF-32LE or UTF-32BE.    This is all 
pretty easy to understand as long as you stick with NumPy arrays only. 

The difficulty starts when you start to interact with the unicode array scalar 
(which is the same data-structure exactly as a Python unicode object with a 
different type-name --- numpy.unicode_).    However, I overlooked the 
"encoding" argument to the standard "unicode" constructor which might have 
simplified what we are doing.    If I understand things correctly, now, all we 
need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in the array 
(depending on the dtype) into a unicode object. 

This is easily accomplished with  numpy.unicode_(<bytes object>, 'utf_32_be'  
or 'utf_32_le').    There is also an "encoding" equivalent to go from the 
Python unicode object to the bytes representation in the NumPy array.   I think 
this is what we should be doing in most of the places and it should 
considerably simplify the Unicode code in NumPy --- eliminating possibly the 
ucsnarrow.c file. 

Am I missing something? 

Thanks, 

-Travis


 
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Unicode revisited

Reply via email to