[ 
https://issues.apache.org/jira/browse/ARROW-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rob updated ARROW-2814:
-----------------------
    Description: 
There is a problem when trying to run pa.Table.from_pandas() on a parquet file 
that has a json string in it.  I have attached the file to this ticket that is 
the source of the problem and the code below will show the error

 

## Reproducible code

import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

pd.options.display.max_colwidth = 10000

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)
 # Fails

table_output = pa.Table.from_pandas(panda_table)

del panda_table['payload']
 # Works

table_output = pa.Table.from_pandas(panda_table)
 # payload is the faulty column. Print out data

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)

table_output = pa.Table.from_pandas(panda_table[['payload']])

panda_table[['payload']]

  was:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file 
that has a json string in it.  I have attached the file to this ticket that is 
the source of the problem and the code below will show the error

# Reproducible code

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

pd.options.display.max_colwidth = 10000

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
panda_table = pq_table.to_pandas() 
orginal_count = len(panda_table)

# Fails

table_output = pa.Table.from_pandas(panda_table)

del panda_table['payload']

# Works

table_output = pa.Table.from_pandas(panda_table)

# payload is the faulty column. Print out data

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
panda_table = pq_table.to_pandas() 
orginal_count = len(panda_table)

table_output = pa.Table.from_pandas(panda_table[['payload']])

panda_table[['payload']]


> Error inferring Arrow type for Python object array. Got Python object of type 
> dict but can only handle these types: string, bool, float, int, date, time, 
> decimal, list, array
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2814
>                 URL: https://issues.apache.org/jira/browse/ARROW-2814
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: rob
>            Priority: Blocker
>         Attachments: 
> part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet
>
>
> There is a problem when trying to run pa.Table.from_pandas() on a parquet 
> file that has a json string in it.  I have attached the file to this ticket 
> that is the source of the problem and the code below will show the error
>  
> ## Reproducible code
> import pandas as pd
>  import pyarrow as pa
>  import pyarrow.parquet as pq
> pd.options.display.max_colwidth = 10000
> pq_table = 
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>  
>  panda_table = pq_table.to_pandas() 
>  orginal_count = len(panda_table)
>  # Fails
> table_output = pa.Table.from_pandas(panda_table)
> del panda_table['payload']
>  # Works
> table_output = pa.Table.from_pandas(panda_table)
>  # payload is the faulty column. Print out data
> pq_table = 
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>  
>  panda_table = pq_table.to_pandas() 
>  orginal_count = len(panda_table)
> table_output = pa.Table.from_pandas(panda_table[['payload']])
> panda_table[['payload']]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to