cyb70289 commented on issue #14229:
URL: https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
Managed to reproduced this error from a dataset with a single column
containing a list of integers.
- to generate the dataset
```python
import numpy as np
import pandas as pd
# total rows < max(int32)
n_rows = 108000000
# dataframe has only one column containing a list of 200 integers
# 200 * n_rows > max(int32)
data = [np.zeros(200, dtype='int8')] * n_rows
print('generating...')
df = pd.DataFrame()
# only one column
df['a'] = data
print('saving ...')
df.to_parquet('/tmp/pq')
print('done')
```
- to load the dataset
```python
import pandas as pd
print('loading...')
df = pd.read_parquet('/tmp/pq', use_threads=False)
print('size = {}'.format(df.shape))
```
Tested with `pyarrow-9.0.0` and `pandas-1.5`. Loading dataset failed with
`OSError: List index overflow.`.
**NOTE**: loading the dataset leads to "out of memory kill" on a machine
with 128G RAM. I have to test it on a 256G RAM machine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]