Dave Hirschfeld created ARROW-2799: -------------------------------------- Summary: Table.from_pandas silently truncates data, even when passed a schema Key: ARROW-2799 URL: https://issues.apache.org/jira/browse/ARROW-2799 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Dave Hirschfeld
Ported over from [https://github.com/apache/arrow/issues/2217] ```python In [8]: import pandas as pd ...: import pyarrow as arw In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)}) ...: df Out[9]: A B 0 a 0 1 b 1 2 c 2 In [10]: schema = arw.schema([ ...: arw.field('A', arw.string()), ...: arw.field('B', arw.int32()), ...: ]) In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema) ...: tbl Out[11]: pyarrow.Table A: string B: int32 metadata -------- {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}, {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]' b', "pandas_version": "0.23.1"}'} In [12]: tbl.to_pandas().equals(df) Out[12]: True ``` ...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame. Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised! ```python In [13]: df['B'].iloc[0] = 1.23 ...: df Out[13]: A B 0 a 1.23 1 b 1.00 2 c 2.00 In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema) ...: tbl Out[14]: pyarrow.Table A: string B: int32 metadata -------- {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}, {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "float64", "metadata": null' b'}], "pandas_version": "0.23.1"}'} In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!! Out[15]: A B 0 a 1 1 b 1 2 c 2 ``` To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)