[
https://issues.apache.org/jira/browse/ARROW-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
rob updated ARROW-2814:
-----------------------
Description:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file
that has a json string in it. I have attached the file to this ticket that is
the source of the problem and the code below will show the error.
h2. Reproducible code
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
pd.options.display.max_colwidth = 10000
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
h2. Fails
table_output = pa.Table.from_pandas(panda_table)
del panda_table['payload']
h2. Works
table_output = pa.Table.from_pandas(panda_table)
h2. Payload is the faulty column. Print out data
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
table_output = pa.Table.from_pandas(panda_table[['payload']])
panda_table[['payload']]
was:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file
that has a json string in it. I have attached the file to this ticket that is
the source of the problem and the code below will show the error.
h2. Reproducible code
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
pd.options.display.max_colwidth = 10000
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
h2. h2. Fails
table_output = pa.Table.from_pandas(panda_table)
del panda_table['payload']
h3. h2. Works
table_output = pa.Table.from_pandas(panda_table)
h3. h2. payload is the faulty column. Print out data
pq_table =
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
panda_table = pq_table.to_pandas()
orginal_count = len(panda_table)
table_output = pa.Table.from_pandas(panda_table[['payload']])
panda_table[['payload']]
> Error inferring Arrow type for Python object array. Got Python object of type
> dict but can only handle these types: string, bool, float, int, date, time,
> decimal, list, array
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-2814
> URL: https://issues.apache.org/jira/browse/ARROW-2814
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Reporter: rob
> Priority: Blocker
> Attachments:
> part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet
>
>
> There is a problem when trying to run pa.Table.from_pandas() on a parquet
> file that has a json string in it. I have attached the file to this ticket
> that is the source of the problem and the code below will show the error.
> h2. Reproducible code
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> pd.options.display.max_colwidth = 10000
> pq_table =
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>
> panda_table = pq_table.to_pandas()
> orginal_count = len(panda_table)
> h2. Fails
> table_output = pa.Table.from_pandas(panda_table)
> del panda_table['payload']
> h2. Works
> table_output = pa.Table.from_pandas(panda_table)
> h2. Payload is the faulty column. Print out data
> pq_table =
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>
> panda_table = pq_table.to_pandas()
> orginal_count = len(panda_table)
> table_output = pa.Table.from_pandas(panda_table[['payload']])
> panda_table[['payload']]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)