[ 
https://issues.apache.org/jira/browse/ARROW-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-2799:
-----------------------------------
    Summary: [C++/Python] Add safe option to Table.from_pandas to avoid unsafe 
casts  (was: [Python] Add safe option to Table.from_pandas to avoid unsafe 
casts)

> [C++/Python] Add safe option to Table.from_pandas to avoid unsafe casts
> -----------------------------------------------------------------------
>
>                 Key: ARROW-2799
>                 URL: https://issues.apache.org/jira/browse/ARROW-2799
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Dave Hirschfeld
>            Assignee: Krisztian Szucs
>            Priority: Major
>             Fix For: 0.11.0
>
>
> Ported over from [https://github.com/apache/arrow/issues/2217]
> ```python
> In [8]: import pandas as pd
>    ...: import pyarrow as arw
> In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
>    ...: df
> Out[9]:
>    A  B
> 0  a  0
> 1  b  1
> 2  c  2
> In [10]: schema = arw.schema([
>     ...:     arw.field('A', arw.string()),
>     ...:     arw.field('B', arw.int32()),
>     ...: ])
> In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
>     ...: tbl
> Out[11]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": 
> [{"name":'
>             b' "A", "field_name": "A", "pandas_type": "unicode", 
> "numpy_type":'
>             b' "object", "metadata": null}, {"name": "B", "field_name": "B", 
> "'
>             b'pandas_type": "int32", "numpy_type": "int32", "metadata": 
> null}]'
>             b', "pandas_version": "0.23.1"}'}
> In [12]: tbl.to_pandas().equals(df)
> Out[12]: True
> ```
> ...so if the `schema` matches the pandas datatypes all is well - we can 
> roundtrip the DataFrame.
> Now, say we have some bad data such that column 'B' is now of type float64. 
> The datatypes of the DataFrame don't match the explicitly supplied `schema` 
> object but rather than raising a `TypeError` the data is silently truncated 
> and the roundtrip DataFrame doesn't match our input DataFame without even a 
> warning raised!
> ```python
> In [13]: df['B'].iloc[0] = 1.23
>     ...: df
> Out[13]:
>    A     B
> 0  a  1.23
> 1  b  1.00
> 2  c  2.00
> In [14]: # I would expect/want this to raise a TypeError since the schema 
> doesn't match the pandas datatypes
>     ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
>     ...: tbl
> Out[14]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": 
> [{"name":'
>             b' "A", "field_name": "A", "pandas_type": "unicode", 
> "numpy_type":'
>             b' "object", "metadata": null}, {"name": "B", "field_name": "B", 
> "'
>             b'pandas_type": "int32", "numpy_type": "float64", "metadata": 
> null'
>             b'}], "pandas_version": "0.23.1"}'}
> In [15]: tbl.to_pandas()  # <-- SILENT TRUNCATION!!!
> Out[15]:
>    A  B
> 0  a  1
> 1  b  1
> 2  c  2
> ```
> To be clear, I would really like `Table.from_pandas` to raise a `TypeError` 
> if the DataFrame types don't match an explicitly supplied schema and would 
> hope this current behaviour would be considered a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to