eadwright commented on pull request #902:
URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-843866914
Not a unit test yet, but I have at least found a reliable way to create a
parquet file in python which causes Java to break when reading it. For this we
require a huge number of rows, and a string column with very high entropy to
avoid the use dictionary lookup or compression. Not these conditions can be
real-world - records containing long SHA3 hashes for example.
This python code needs more than 16GB of RAM to execute.
```import pandas as pd
import numpy as np
rand_array = np.random.rand(75000000, 4)
df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3',
'number4'])
df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) +
df['number3'].astype(str)
df.to_parquet("random.parquet", compression='snappy', engine='pyarrow')```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]