[
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438721#comment-16438721
]
ASF GitHub Bot commented on ARROW-2101:
---------------------------------------
joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly
convert numpy arrays of bytes to arrow arrays of strings when user specifies
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381410530
@pitrou, on second look it won't be more efficient to move the check to
outside of AppendObjectStrings. When passing check_valid to
AppendObjectStrings, the UTF-8 decoding/check only happens if the data is
Python 3 bytes or Python 2 strings. However, if the user passes Python 3
strings or Python 2 unicode and wants a string type, no extra checks are done.
In the case where the user wants the output type to be an arrow string, then we
need to do the check on each bytes object. Otherwise, we will return a
StringArray that has data that's not actually UTF-8.
Please let me know if that makes sense, and if not, let me know how you
would make it faster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> ------------------------------------------------------------------------
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.8.0
> Reporter: Bryan Cutler
> Assignee: Bryan Cutler
> Priority: Major
> Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow
> data of binary type, even if the user supplies type information. conversion
> of 'unicode' type works to create Arrow data of string types. For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)