adrienchaton opened a new issue, #14229:
URL: https://github.com/apache/arrow/issues/14229
Hello,
I am storing pandas dataframe as .parquet with pd.to_parquet and then try to
load them back with pd.read_parquet.
I am experiencing some error for which I do not find solution and would
kindly ask for help to solve this ...
Here is the trace:
` File
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py",
line 493, in read_parquet
return impl.read(
File
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py",
line 240, in read
result = self.api.parquet.read_table(
File
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File
"/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py",
line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.`
If I store a small dataframe, I do not face this error.
If I store a larger dataframe with e.g. 295.912.999 rows then I get this
error.
However before saving it, I print the index range and it is bound in 0
295912998.
Whether I save the .parquet with index=True or False gives the same error
but I do not understand why there is an overflow on the bounded index ...
Any hints are much appreciated, thanks !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]