[jira] [Comment Edited] (ARROW-1633) [Python] numpy "unicode" arrays not understood

Wes McKinney (JIRA) Wed, 04 Oct 2017 15:54:38 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192175#comment-16192175
 ]


Wes McKinney edited comment on ARROW-1633 at 10/4/17 10:53 PM:
---------------------------------------------------------------

Well so after some spelunking, I discovered these facts:

* Unicode in NumPy is always UCS4 
https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/src/multiarray/ucsnarrow.c#L20
* UCS4 and UTF-32 are in practice the same thing (see "History" section in 
https://en.wikipedia.org/wiki/UTF-32, I don't know why this had to be so 
complicated, but the first rule of Unicode is that it is not simple)
* codecvt, while part of the Cpp11 standard, is not supported in gcc 4.8 (our 
minimum version), so that stinks. So at some point in the future we can use the 
Cpp standard library for converting things to UTF8 but that day is not today
* Python 2.7 has UTF32 codecs we can use for converting from UTF32 to UTF8 
https://docs.python.org/2.7/c-api/unicode.html#utf-32-codecs. 
* Python 3.6 seems to have the same UTF32 API 
https://docs.python.org/3.6/c-api/unicode.html#utf-32-codecs

We aren't supporting ASCII/fixed size binary types from NumPy yet, so we should 
probably handle these at the same time. We can use the (slow) Python C APIs for 
now to get UTF8 and maybe replace these with something natively C/Cpp if 
someone cares enough about perf to do the work


was (Author: wesmckinn):
Well so after some spelunking, I discovered these facts:

* Unicode in NumPy is always UCS4 
https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/src/multiarray/ucsnarrow.c#L20
* UCS4 and UTF-32 are in practice the same thing (see "History" section in 
https://en.wikipedia.org/wiki/UTF-32, I don't know why this had to be so 
complicated, but the first rule of Unicode is that it is not simple)
* codecvt, while part of the C++11 standard, is not supported in gcc 4.8 (our 
minimum version), so that stinks. So at some point in the future we can use the 
C++ standard library for converting things to UTF8 but that day is not today
* Python 2.7 has UTF32 codecs we can use for converting from UTF32 to UTF8 
https://docs.python.org/2.7/c-api/unicode.html#utf-32-codecs. 
* Python 3.6 seems to have the same UTF32 API 
https://docs.python.org/3.6/c-api/unicode.html#utf-32-codecs

We aren't supporting ASCII/fixed size binary types from NumPy yet, so we should 
probably handle these at the same time. We can use the (slow) Python C APIs for 
now to get UTF8 and maybe replace these with something natively C/C++ if 
someone cares enough about perf to do the work

> [Python] numpy "unicode" arrays not understood
> ----------------------------------------------
>
>                 Key: ARROW-1633
>                 URL: https://issues.apache.org/jira/browse/ARROW-1633
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Nick White
>             Fix For: 0.8.0
>
>
> {code}
> import numpy as np
> pa.StringArray.from_pandas(np.empty(1, np.unicode))
> {code}
> Throws:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> <ipython-input-68-f9bc946f2c0a> in <module>()
>       1 import numpy as np
> ----> 2 pa.StringArray.from_pandas(np.empty(1, np.unicode))
> array.pxi in pyarrow.lib.Array.from_pandas()
> error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Unsupported numpy type 19
> {noformat}
> np.object arrays work, though...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (ARROW-1633) [Python] numpy "unicode" arrays not understood

Reply via email to