cyb70289 commented on issue #14229:
URL: https://github.com/apache/arrow/issues/14229#issuecomment-1272223773

   Managed to reproduced this error from a dataset with a single column 
containing a list of integers.
   
   - to generate the dataset
   ```python
   import numpy as np
   import pandas as pd
   
   # total rows < max(int32)
   n_rows = 108000000
   
   # dataframe has only one column containing a list of 200 integers
   # 200 * n_rows > max(int32)
   data = [np.zeros(200, dtype='int8')] * n_rows
   
   print('generating...')
   df = pd.DataFrame()
   # only one column
   df['a'] = data
   
   print('saving ...')
   df.to_parquet('/tmp/pq')
   print('done')
   ```
   
   - to load the dataset
   ```python
   import pandas as pd
   
   print('loading...')
   df = pd.read_parquet('/tmp/pq', use_threads=False)
   print('size = {}'.format(df.shape))
   ```
   
   Tested with `pyarrow-9.0.0` and `pandas-1.5`. Loading dataset failed with 
`OSError: List index overflow.`.
   
   **NOTE**: loading the dataset leads to "out of memory kill" on a machine 
with 128G RAM. I have to test it on a 256G RAM machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to