[jira] [Commented] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Chen Ming (Jira) Thu, 01 Oct 2020 08:24:55 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205608#comment-17205608
 ]


Chen Ming commented on ARROW-10140:
-----------------------------------

[~jorisvandenbossche] Also thanks for telling that the latest master can read 
the parquet file back into Arrow. (We got the "Reading lists of structs from 
Parquet files not yet supported" error with PyArrow 1.0.1.)

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Reply via email to