[jira] [Commented] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

Joris Van den Bossche (JIRA) Fri, 07 Jun 2019 04:52:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858547#comment-16858547
 ]


Joris Van den Bossche commented on ARROW-2298:
----------------------------------------------

[~farnoy] For me, the example you show above works:
{code}
In [33]: schema = pa.schema([)a.field(name='a', type=pa.int64(), 
nullable=True)])                                                                
                                                    

In [34]: pa.Table.from_pandas(df, schema=schema, preserve_index=False)          
                                                                                
                                               
Out[34]: 
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "'
            b'float64", "metadata": null}], "creator": {"library": "pyarrow", '
            b'"version": "0.13.1.dev313+g997226a9"}, "pandas_version": "0.24.2'
            b'"}'}

In [35]: table = _                                                              
                                                                                
                                                    

In [36]: table.column('a')                                                      
                                                                                
                                                    
Out[36]: 
<Column name='a' type=DataType(int64)>
[
  [
    null,
    1,
    2,
    3,
    null
  ]
]
{code}

this is because in {{Table.from_pandas}} we assume data are coming from pandas 
and allow the above. 

Using just the array API, you can see that with (converting float numpy array 
to integer arrow array):

{code:python}
In [41]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64())       
                                                                                
                                                    
...
ArrowInvalid: Floating point value truncated

In [42]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64(), 
from_pandas=True)                                                               
                                                          
Out[42]: 
<pyarrow.lib.Int64Array object at 0x7feaeea36548>
[
  1,
  2,
  null
]
{code}

Does that satisfy your use case? 

It might not help with for very big integers that cannot be represented 
properly as floats (that will still raise an error about values being 
truncated), but I think if you are coming from pandas, that use case will not 
be very frequent, exactly because pandas cannot properly represent that itself.

> [Python] Add option to not consider NaN to be null when converting to an 
> integer Arrow type
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2298
>                 URL: https://issues.apache.org/jira/browse/ARROW-2298
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Follow-on work to ARROW-2135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

Reply via email to