[jira] [Commented] (ARROW-11344) [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

Chen Ming (Jira) Mon, 25 Jan 2021 19:40:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271792#comment-17271792
 ]


Chen Ming commented on ARROW-11344:
-----------------------------------

[~westonpace] Thank you for the information. And very happy to see 3.0.0 has 
been released to PyPI this morning. From my quick test with the example data, 
the issue has been fixed by PyArrow 3.0.0.

We want to do more testing (with our production data), so I would like to keep 
this Jira in open state for a few more days.

> [Python] Data of struct fields are our-of-order in parquet files created by 
> the write_table() method
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11344
>                 URL: https://issues.apache.org/jira/browse/ARROW-11344
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Chen Ming
>            Priority: Major
>         Attachments: test_struct.csv, test_struct_200.parquet, 
> test_struct_200.py, test_struct_200_flat.parquet, test_struct_200_flat.py
>
>
> Hi,
> We found an our-of-order issue with the 'struct' data type recently, would 
> like to know if you can help to root cause it.
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": 
> x["file_name"]}, axis=1)
> my_df = df.drop(['file_package', 'file_name'], axis=1)
> file_fields = [('package', pa.string()), ('name', pa.string()),]
> my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
>                        pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(my_df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200.parquet')
> {code}
> The above code (attached as test_struct_200.py) runs with the following 
> python packages:
> {code:java}
> Pandas Version = 1.1.3
> PyArrow Version = 2.0.0
> {code}
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> {code}
> (BTW, you can also view the parquet file with 
> [http://parquet-viewer-online.com/])
> The output is supposed to be (refer to test_struct.csv) :
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> {code}
> As a comparison, the following code (attached as test_struct_200_flat.py) 
> would generate a parquet file with the same data of test_struct.csv:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> my_schema = pa.schema([pa.field('file_package', pa.string()),
>                        pa.field('file_name', pa.string()),
>                        pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200_flat.parquet')
> {code}
> I also attached the two parquet files for your references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11344) [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

Reply via email to