[
https://issues.apache.org/jira/browse/ARROW-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
rob updated ARROW-2814:
-----------------------
Description:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file
that has a json string in it. I have attached the file to this ticket that is
the source of the problem and the code below will show the error
## Reproducible code
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
pd.options.display.max_colwidth = 10000
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
# Fails
table_output = pa.Table.from_pandas(panda_table)
del panda_table['payload']
# Works
table_output = pa.Table.from_pandas(panda_table)
# payload is the faulty column. Print out data
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
table_output = pa.Table.from_pandas(panda_table[['payload']])
panda_table[['payload']]
was:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file
that has a json string in it. I have attached the file to this ticket that is
the source of the problem and the code below will show the error
# Reproducible code
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
pd.options.display.max_colwidth = 10000
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
# Fails
table_output = pa.Table.from_pandas(panda_table)
del panda_table['payload']
# Works
table_output = pa.Table.from_pandas(panda_table)
# payload is the faulty column. Print out data
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
table_output = pa.Table.from_pandas(panda_table[['payload']])
panda_table[['payload']]
> Error inferring Arrow type for Python object array. Got Python object of type
> dict but can only handle these types: string, bool, float, int, date, time,
> decimal, list, array
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-2814
> URL: https://issues.apache.org/jira/browse/ARROW-2814
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Reporter: rob
> Priority: Blocker
> Attachments:
> part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet
>
>
> There is a problem when trying to run pa.Table.from_pandas() on a parquet
> file that has a json string in it. I have attached the file to this ticket
> that is the source of the problem and the code below will show the error
>
> ## Reproducible code
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> pd.options.display.max_colwidth = 10000
> pq_table =
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>
> panda_table = pq_table.to_pandas()
> orginal_count = len(panda_table)
> # Fails
> table_output = pa.Table.from_pandas(panda_table)
> del panda_table['payload']
> # Works
> table_output = pa.Table.from_pandas(panda_table)
> # payload is the faulty column. Print out data
> pq_table =
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>
> panda_table = pq_table.to_pandas()
> orginal_count = len(panda_table)
> table_output = pa.Table.from_pandas(panda_table[['payload']])
> panda_table[['payload']]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)