[jira] [Comment Edited] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

Kari Schoonbee (Jira) Tue, 19 Jan 2021 08:08:09 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267998#comment-17267998
 ]


Kari Schoonbee edited comment on ARROW-11257 at 1/19/21, 4:07 PM:
------------------------------------------------------------------

Thanks Joris, that is great. I'll keep an eye out.

I can also add that doing the parquet round-trip using `pyspark==3.0.0` works 
using `data_frame.write.parquet()`


was (Author: kari_s):
Thanks Joris, that is great. I'll keep an eye out.

I can also add that doing the same parquet round-trip using `pyspark==3.0.0` 
works.

> [C++][Parquet] PyArrow Table contains different data after writing and 
> reloading from Parquet
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11257
>                 URL: https://issues.apache.org/jira/browse/ARROW-11257
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Kari Schoonbee
>            Priority: Critical
>         Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb
>
>
> * I'm loading a JSONlines object into a table using 
> {code:java}
> pa.json.readjson{code}
> It contains one column that is a nested dictionary.
>  * I select a row by key and inspect its nested dictionary.
>  * I write the table to parquet 
>  * I load the table again from the parquet file 
>  * I check the same key and the nested dictionary is not the same.
>  
> To reproduce:
>  
> Find the attached JSONLines file and Jupyter Notebook. 
> The json file contains entries per customer with a generated `msisdn`, 
> `scoring_request_id` and `scorecard_result` object. Each `scorecard result 
> consists of a list of feature objects, all with the value the same as the 
> msidn` and a score.
> The notebook reads the file and demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11257) [C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

Reply via email to