[jira] [Commented] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

Joris Van den Bossche (JIRA) Fri, 02 Aug 2019 03:49:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898787#comment-16898787
 ]


Joris Van den Bossche commented on ARROW-5682:
----------------------------------------------

This seems to be specific to the code paths dealing with numpy arrays, as from 
built-in python objects, you get a logical error:

{code}
In [9]: pa.array([1, 2, 3], pa.string())
...
ArrowTypeError: Expected a string or bytes object, got a 'int' object

In [10]: pa.array(np.array([1, 2, 3]), pa.string()) 
Out[10]: 
<pyarrow.lib.StringArray object at 0x7f0909c902b0>
[
  "",   # <-- this is actually not an empty string but '\x01'
  "",
  ""
]
{code}

I agree that at least an error should be raised instead of those incorrect 
values.

In numpy you can cast ints to their string representation by doing an 
equivalent call:

{code}
In [13]: np.array(np.array([1, 2, 3], dtype=int), dtype=str)
Out[13]: array(['1', '2', '3'], dtype='<U21')
{code}

Not sure if we should do something similar though (and certainly not a priority 
I would say).

> [Python] from_pandas conversion casts values to string inconsistently
> ---------------------------------------------------------------------
>
>                 Key: ARROW-5682
>                 URL: https://issues.apache.org/jira/browse/ARROW-5682
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Bryan Cutler
>            Priority: Minor
>
> When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
> string with  "type=pa.string()", the resulting pyarrow Array can have 
> inconsistent values. For most input, the result is an empty string, however 
> for some types (int32, int64) the values are '\x01' etc.
> {noformat}
> In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
> In [9]: pa.Array.from_pandas(s, type=pa.string())                             
>                                                
> Out[9]: 
> <pyarrow.lib.StringArray object at 0x7f90b6091a48>
> [
>   "",
>   "",
>   ""
> ]
> In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)                            
>                                                
> In [11]: pa.Array.from_pandas(s, type=pa.string())                            
>                                                
> Out[11]: 
> <pyarrow.lib.StringArray object at 0x7f9097efca48>
> [
>   "",
>   "",
>   ""
> ]
> {noformat}
> This came from the Spark discussion 
> https://github.com/apache/spark/pull/24930/files#r296187903. Type casting 
> this way in Spark is not supported, but it would be good to get the behavior 
> consistent. Would it be better to raise an UnsupportedOperation error?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

Reply via email to