[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

Antoine Pitrou (Jira) Wed, 23 Jun 2021 07:15:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368235#comment-17368235
 ]


Antoine Pitrou commented on ARROW-9812:
---------------------------------------

The test script in the issue description works on git master. I'm going to 
close this issue.

> [Python] Map data types doesn't work from Arrow to Parquet
> ----------------------------------------------------------
>
>                 Key: ARROW-9812
>                 URL: https://issues.apache.org/jira/browse/ARROW-9812
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Mayur Srivastava
>            Priority: Major
>             Fix For: 5.0.0
>
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
>     t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
>     print('PASSED')
>     print(t1)
> except:
>     print(f'FAILED')
>     tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
>     t1.to_pandas()
>     print('PASSED')
> except:
>     print(f'FAILED')
>     tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
>     pq.write_table(t1, fh)
>     print('PASSED')
> except:
>     print('FAILED')
>     tb.print_exc()
>     
> print(f'Parquet -> Arrow')
> try:
>     t2 = pq.read_table(source=fh)
>     print('PASSED')
>     print(t2)
> except:
>     print('FAILED')
>     tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map<string, string>
>  child 0, entries: struct<key: string not null, value: string> not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in <module> t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map<string, string> is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in <module> 
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: list<key_value: struct<key: string not 
> null, value: string> not null> not null
> {code}
> Updated to indicate to Pandas conversion done, but not yet for Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

Reply via email to