ASF GitHub Bot commented on ARROW-2101:

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381410530
   @pitrou, on second look it won't be more efficient to move the check to 
outside of AppendObjectStrings. When passing check_valid to 
AppendObjectStrings, the UTF-8 decoding/check only happens if the data is 
Python 3 bytes or Python 2 strings. However, if the user passes Python 3 
strings or Python 2 unicode and wants a string type, no extra checks are done. 
In the case where the user wants the output type to be an arrow string, then we 
need to do the check on each bytes object. Otherwise, we will return a 
StringArray that has data that's not actually UTF-8.
   Please let me know if that makes sense, and if not, let me know how you 
would make it faster. 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> ------------------------------------------------------------------------
>                 Key: ARROW-2101
>                 URL: https://issues.apache.org/jira/browse/ARROW-2101
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>            Priority: Major
>              Labels: pull-request-available
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}

This message was sent by Atlassian JIRA

Reply via email to