[
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336
]
David Lee edited comment on ARROW-3065 at 9/24/18 7:58 PM:
-----------------------------------------------------------
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the
column doesn't exist to start and is added using pandas.reindex(). The
reasoning behind this is the original file(s) being converted to parquet may or
may not contain all 100+ columns.
{quote}import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
pa.field('col1', pa.string()),
pa.field('col2', pa.string()),
])
df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])
df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)
tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)
tbl3 = pa.concat_tables([tbl1, tbl2])
Traceback (most recent call last):
{\{ File "<stdin>", line 1, in <module>}}
\{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
\{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}
was (Author: [email protected]):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the
column doesn't exist to start and is added using pandas.reindex(). The
reasoning behind this is the original file(s) being converted to parquet may or
may not contain all 100+ columns.
{quote}import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
pa.field('col1', pa.string()),
pa.field('col2', pa.string()),
])
{{df1 = pd.DataFrame([
Unknown macro: \{"col1"}
for v in list("abcdefgh")])}}
{{df2 = pd.DataFrame([
Unknown macro: \{"col2"}
for v in list("abcdefgh")])}}
df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)
tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)
tbl3 = pa.concat_tables([tbl1, tbl2])
Traceback (most recent call last):
{\{ File "<stdin>", line 1, in <module>}}
\{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
\{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}
> [Python] concat_tables() failing from bad Pandas Metadata
> ---------------------------------------------------------
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.10.0
> Reporter: David Lee
> Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
> File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
> File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files
> and inspected the parquet files. The parquet schema is identical, but the
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
> pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type":
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> <Column name='HoldingDetail_Id' type=DataType(string)>
> [
> [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type":
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> <Column name='HoldingDetail_Id' type=DataType(string)>
> [
> [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)