[ 
https://issues.apache.org/jira/browse/ARROW-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192175#comment-16192175
 ] 

Wes McKinney commented on ARROW-1633:
-------------------------------------

Well so after some spelunking, I discovered these facts:

* Unicode in NumPy is always UCS4 
https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/src/multiarray/ucsnarrow.c#L20
* UCS4 and UTF-32 are in practice the same thing (see "History" section in 
https://en.wikipedia.org/wiki/UTF-32, I don't know why this had to be so 
complicated, but the first rule of Unicode is that it is not simple)
* codecvt, while part of the C++11 standard, is not supported in gcc 4.8 (our 
minimum version), so that stinks. So at some point in the future we can use the 
C++ standard library for converting things to UTF8 but that day is not today
* Python 2.7 has UTF32 codecs we can use for converting from UTF32 to UTF8 
https://docs.python.org/2.7/c-api/unicode.html#utf-32-codecs. 
* Python 3.6 seems to have the same UTF32 API 
https://docs.python.org/3.6/c-api/unicode.html#utf-32-codecs

We aren't supporting ASCII/fixed size binary types from NumPy yet, so we should 
probably handle these at the same time. We can use the (slow) Python C APIs for 
now to get UTF8 and maybe replace these with something natively C/C++ if 
someone cares enough about perf to do the work

> [Python] numpy "unicode" arrays not understood
> ----------------------------------------------
>
>                 Key: ARROW-1633
>                 URL: https://issues.apache.org/jira/browse/ARROW-1633
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Nick White
>             Fix For: 0.8.0
>
>
> {code}
> import numpy as np
> pa.StringArray.from_pandas(np.empty(1, np.unicode))
> {code}
> Throws:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> <ipython-input-68-f9bc946f2c0a> in <module>()
>       1 import numpy as np
> ----> 2 pa.StringArray.from_pandas(np.empty(1, np.unicode))
> array.pxi in pyarrow.lib.Array.from_pandas()
> error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Unsupported numpy type 19
> {noformat}
> np.object arrays work, though...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to