[
https://issues.apache.org/jira/browse/ARROW-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe L. Korn resolved ARROW-374.
-------------------------------
Resolution: Fixed
Issue resolved by pull request 249
[https://github.com/apache/arrow/pull/249]
> Python: clarify unicode vs. binary in API
> -----------------------------------------
>
> Key: ARROW-374
> URL: https://issues.apache.org/jira/browse/ARROW-374
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.1.0
> Reporter: Jochen Ott
> Assignee: Wes McKinney
> Priority: Minor
>
> pyarrow supports arrow's String type, arrow-internally represented as
> BINARY+UTF8 annotation.
> In python 2, the pyarrow API accept both {{unicode}} and binary strings
> ({{str}}), where the latter are assumed to be utf-8 encoded. I find this
> approach problematic, because:
> * there is an implicit assumption that a binary {{str}} contains valid utf-8
> data. This assumption can be wrong, however, and it's not clear what the
> consequences are of passing such "invalid data" to the API are.
> * the utf-8 assumption is not clearly documented or otherwise visible from
> the API
> * if pyarrow wants to support pure binary data in the future, a natural
> choice would be to use {{str}} as python2 type. However, this would conflict
> with the current interpretation of binary {{str}} as BINARY+UTF8
> *Proposed solution*
> I propose to change the API that it only accepts or returns unicode strings,
> i.e. python2's {{unicode}} and python3's {{str}}. Passing a python2 {{str}}
> should raise an exception, same for python3's {{bytes}}.
> If in some point in the future also raw BINARY is supported, use python3's
> {{bytes}} and python2's {{str}}.
> As convenience feature for API users, the API may allow to also pass utf-8
> encoded binary data as arrow's String, but that should be an explicit, opt-in
> choice, s.t. API users are aware of the (encoding-)assumptions made.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)