Hugo-loio opened a new issue, #38552:
URL: https://github.com/apache/arrow/issues/38552
### Describe the bug, including details regarding any error messages,
version, and platform.
OS - archlinux
Python version - 3.11.5
pyarrow version - 13.0.0
In my current project, I have to load parquet files which are reasonably
big, up to 800 Mb so far. The tables have a lot of columns but not that many
lines.
I have a lot of trouble loading the parquet tables from disk and the only
working solution completely blows up the memory usage in my computer.
The following script reproduces the problem:
```python
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import os
import gc
array_length = 2**20
num_arrays = 10
filename = "test.parquet"
# Only run this if the data file hasn't been already generated
if(not os.path.isfile(filename)):
# Generate arrays and write them to a dataframe
print("Generating the dataframe...")
columns = np.arange(1, array_length + 1)
data = pd.DataFrame(columns = columns)
for i in range(1,num_arrays+1):
array = np.random.rand(array_length)
data.loc[i] = array
print("Done")
# Convert the dataframe to arrow and then save to disk as parquet
print("Saving to disk...")
arrow_data = pa.Table.from_pandas(data)
pq.write_table(arrow_data, filename)
print("Done")
# Reading the data
try:
print("Option 1:")
pq.read_table(filename)
except OSError as e:
print("Option 1 failed with the following error:\n", e)
print("Option 2:")
print("Memory usage is blowing up here...")
limit = 2**31-1 # Maximum value I can choose
pq.read_table(filename, thrift_string_size_limit = limit,
thrift_container_size_limit = limit)
gc.collect()
print("Option 3:")
print("This option either takes too long or it just gets stuck...")
print("Notice how the memory usage is still high from option 2, even
after calling the garbage collector.")
pd.read_parquet(filename, engine = 'fastparquet')
```
If you run the script you will notice that:
1) `parquet.read_table()` fails with `OSError : Couldn't deserialize
thrift: TProtocolException: Exceeded size limit`
2) Increasing the `thrift_..._size_limit` options to the maximum value
solves the problem but makes the memory usage blow up and the garbage collector
doesn't collect that memory after reading.
3) Reading with pandas and the `fastparquet` engine doesn't work, it seems
that it either gets stuck or it takes too long.
Since the data doesn't actually take up that much memory, I don't think this
should be happening. I'm a bit worried that bigger datasets in the future might
give me more problems.
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]