[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

Joris Van den Bossche (Jira) Mon, 24 Aug 2020 01:49:19 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183073#comment-17183073
 ]


Joris Van den Bossche commented on ARROW-9812:
----------------------------------------------

[~mayuropensource] thanks for the issue. As you correctly noted, there are two 
different issues at play:

bq. 1. Map data type doesn't work from Arrow -> Pandas.

As the error message indicates, this conversion is not yet implemented. 
I don't think someone is actively working on this, and contributions in this 
area are certainly welcome. 

bq. 2. Map data type doesn't get written to or read from Arrow -> Parquet.

It does get written (as a list of structs, which is how the Map Type is 
represented, see 
https://github.com/apache/arrow/blob/3fb1356ed2e4de7b00decbba081369019b9598a7/format/Schema.fbs#L98-L125).
 However, such mixture of nested lists and structs is not yet supported on the 
read side. This is actively being worked on (see ARROW-1644), and hopefully 
this will work in the next Arrow version.

So you are not directly doing something wrong, but the Map type is not yet very 
well supported in the Parquet and pandas conversions. Using a struct type, as 
in your last example, is typically better supported (simple structs are 
supported both in Parquet IO as in conversion to pandas)

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> ---------------------------------------------------------------------
>
>                 Key: ARROW-9812
>                 URL: https://issues.apache.org/jira/browse/ARROW-9812
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Mayur Srivastava
>            Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
>     t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
>     print('PASSED')
>     print(t1)
> except:
>     print(f'FAILED')
>     tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
>     t1.to_pandas()
>     print('PASSED')
> except:
>     print(f'FAILED')
>     tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
>     pq.write_table(t1, fh)
>     print('PASSED')
> except:
>     print('FAILED')
>     tb.print_exc()
>     
> print(f'Parquet -> Arrow')
> try:
>     t2 = pq.read_table(source=fh)
>     print('PASSED')
>     print(t2)
> except:
>     print('FAILED')
>     tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map<string, string>
>  child 0, entries: struct<key: string not null, value: string> not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in <module> t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map<string, string> is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in <module> 
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: list<key_value: struct<key: string not 
> null, value: string> not null> not null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

Reply via email to