[
https://issues.apache.org/jira/browse/ARROW-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192175#comment-16192175
]
Wes McKinney edited comment on ARROW-1633 at 10/4/17 10:53 PM:
---------------------------------------------------------------
Well so after some spelunking, I discovered these facts:
* Unicode in NumPy is always UCS4
https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/src/multiarray/ucsnarrow.c#L20
* UCS4 and UTF-32 are in practice the same thing (see "History" section in
https://en.wikipedia.org/wiki/UTF-32, I don't know why this had to be so
complicated, but the first rule of Unicode is that it is not simple)
* codecvt, while part of the Cpp11 standard, is not supported in gcc 4.8 (our
minimum version), so that stinks. So at some point in the future we can use the
Cpp standard library for converting things to UTF8 but that day is not today
* Python 2.7 has UTF32 codecs we can use for converting from UTF32 to UTF8
https://docs.python.org/2.7/c-api/unicode.html#utf-32-codecs.
* Python 3.6 seems to have the same UTF32 API
https://docs.python.org/3.6/c-api/unicode.html#utf-32-codecs
We aren't supporting ASCII/fixed size binary types from NumPy yet, so we should
probably handle these at the same time. We can use the (slow) Python C APIs for
now to get UTF8 and maybe replace these with something natively C/Cpp if
someone cares enough about perf to do the work
was (Author: wesmckinn):
Well so after some spelunking, I discovered these facts:
* Unicode in NumPy is always UCS4
https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/src/multiarray/ucsnarrow.c#L20
* UCS4 and UTF-32 are in practice the same thing (see "History" section in
https://en.wikipedia.org/wiki/UTF-32, I don't know why this had to be so
complicated, but the first rule of Unicode is that it is not simple)
* codecvt, while part of the C++11 standard, is not supported in gcc 4.8 (our
minimum version), so that stinks. So at some point in the future we can use the
C++ standard library for converting things to UTF8 but that day is not today
* Python 2.7 has UTF32 codecs we can use for converting from UTF32 to UTF8
https://docs.python.org/2.7/c-api/unicode.html#utf-32-codecs.
* Python 3.6 seems to have the same UTF32 API
https://docs.python.org/3.6/c-api/unicode.html#utf-32-codecs
We aren't supporting ASCII/fixed size binary types from NumPy yet, so we should
probably handle these at the same time. We can use the (slow) Python C APIs for
now to get UTF8 and maybe replace these with something natively C/C++ if
someone cares enough about perf to do the work
> [Python] numpy "unicode" arrays not understood
> ----------------------------------------------
>
> Key: ARROW-1633
> URL: https://issues.apache.org/jira/browse/ARROW-1633
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Nick White
> Fix For: 0.8.0
>
>
> {code}
> import numpy as np
> pa.StringArray.from_pandas(np.empty(1, np.unicode))
> {code}
> Throws:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> <ipython-input-68-f9bc946f2c0a> in <module>()
> 1 import numpy as np
> ----> 2 pa.StringArray.from_pandas(np.empty(1, np.unicode))
> array.pxi in pyarrow.lib.Array.from_pandas()
> error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Unsupported numpy type 19
> {noformat}
> np.object arrays work, though...
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)