[jira] [Created] (ARROW-9812) Map data types doesn't work from Arrow to Pandas and Parquet

Mayur Srivastava (Jira) Thu, 20 Aug 2020 07:08:19 -0700

Mayur Srivastava created ARROW-9812:
---------------------------------------


             Summary: Map data types doesn't work from Arrow to Pandas and 
Parquet
                 Key: ARROW-9812
                 URL: https://issues.apache.org/jira/browse/ARROW-9812
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Mayur Srivastava


Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

But, Arrow to Pandas doesn't work.

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is 
not supported. It seems that map is stored as list of structs.

There are two problems here:
 # Map data type doesn't work from Arrow -> Pandas.
 # Map data type doesn't get written to or read from Arrow -> Parquet.

Questions:

1. Am I doing something wrong? Is there a way to get these to work? 

2. If these are unsupported features, will this be fixed in a future version? 
Do you plans or ETA?

The following code example (followed by output) should demonstrate the issues:

I'm using Arrow 1.0.0 and Pandas 1.0.5.

Thanks!

Mayur
{code:java}
$ cat arrowtest.py

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
import traceback as tb
import io

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df1 = pd.DataFrame({'a': [[('b', '2')]]})
print(f'df1')
print(f'{df1}')

print(f'Pandas -> Arrow')
try:
    t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
pa.map_(pa.string(), pa.string()))]))
    print('PASSED')
    print(t1)
except:
    print(f'FAILED')
    tb.print_exc()

print(f'Arrow -> Pandas')
try:
    t1.to_pandas()
    print('PASSED')
except:
    print(f'FAILED')
    tb.print_exc()print(f'Arrow -> Parquet')

fh = io.BytesIO()
try:
    pq.write_table(t1, fh)
    print('PASSED')
except:
    print('FAILED')
    tb.print_exc()
    
print(f'Parquet -> Arrow')
try:
    t2 = pq.read_table(source=fh)
    print('PASSED')
    print(t2)
except:
    print('FAILED')
    tb.print_exc()
{code}
{code:java}
$ python3.6 arrowtest.py
PyArrow Version = 1.0.0 
Pandas Version = 1.0.5 
df1 
a 0 [(b, 2)] 
 
Pandas -> Arrow 
PASSED 
pyarrow.Table 
a: map<string, string>
 child 0, entries: struct<key: string not null, value: string> not null
 child 0, key: string not null
 child 1, value: string 
 
Arrow -> Pandas 
FAILED 
Traceback (most recent call last):
File "arrowtest.py", line 26, in <module> t1.to_pandas() 
File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas 
File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
"XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
ext_columns_dtypes) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
1115, in _table_to_blocks list(extension_columns.keys())) 
File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
"pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
Arrow data of type map<string, string> is known. 
 
Arrow -> Parquet 
PASSED 
 
Parquet -> Arrow 
FAILED 
Traceback (most recent call last): File "arrowtest.py", line 43, in <module> t2 
= pq.read_table(source=fh) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
read_table use_pandas_metadata=use_pandas_metadata) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
read use_threads=use_threads 
File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status 
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
files not yet supported: key_value: list<key_value: struct<key: string not 
null, value: string> not null> not null
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9812) Map data types doesn't work from Arrow to Pandas and Parquet

Reply via email to