eadwright commented on pull request #902:
URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-843866914


   Not a unit test yet, but I have at least found a reliable way to create a 
parquet file in python which causes Java to break when reading it. For this we 
require a huge number of rows, and a string column with very high entropy to 
avoid the use dictionary lookup or compression. Not these conditions can be 
real-world - records containing long SHA3 hashes for example.
   
   This python code needs more than 16GB of RAM to execute.
   
   ```import pandas as pd
   import numpy as np
   
   rand_array = np.random.rand(75000000, 4)
   
   df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3', 
'number4'])
   df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) + 
df['number3'].astype(str)
   
   df.to_parquet("random.parquet", compression='snappy', engine='pyarrow')```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to