Kari Schoonbee created ARROW-11257:
--------------------------------------

             Summary: PyArrow Table contains different data after writing and 
reloading from Parquet
                 Key: ARROW-11257
                 URL: https://issues.apache.org/jira/browse/ARROW-11257
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Kari Schoonbee
         Attachments: anonymised.jsonl, pyarrow_parquet_issue.ipynb

* I'm loading a JSONlines object into a table using 
{code:java}
pa.json.readjson{code}
It contains one column that is a nested dictionary.
 * I select a row by key and inspect its nested dictionary.
 * I write the table to parquet 
 * I load the table again from the parquet file 
 * I check the same key and the nested dictionary is not the same.

 

To reproduce:

 

Find the attached JSONLines file and Jupyter Notebook. 

The json file contains entries per customer with a generated `msisdn`, 
`scoring_request_id` and `scorecard_result` object. Each `scorecard result 
consists of a list of feature objects, all with the value the same as the 
msidn` and a score.

The notebook reads the file and demonstrates the issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to