[I] [Python][Parquet] Very high memory usage when reading from disk [arrow]

via GitHub Wed, 01 Nov 2023 14:34:46 -0700


Hugo-loio opened a new issue, #38552:
URL: https://github.com/apache/arrow/issues/38552


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   OS - archlinux
   Python version - 3.11.5
   pyarrow version - 13.0.0
   
   In my current project, I have to load parquet files which are reasonably 
big, up to 800 Mb so far. The tables have a lot of columns but not that many 
lines.
   I have a lot of trouble loading the parquet tables from disk and the only 
working solution completely blows up the memory usage in my computer.
   The following script reproduces the problem:
   ```python
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import os
   import gc
   
   array_length = 2**20
   num_arrays = 10
   filename = "test.parquet"
   
   # Only run this if the data file hasn't been already generated
   if(not os.path.isfile(filename)):
   
       # Generate arrays and write them to a dataframe
       print("Generating the dataframe...")
       columns = np.arange(1, array_length + 1)
       data = pd.DataFrame(columns = columns)
       for i in range(1,num_arrays+1):
           array = np.random.rand(array_length)
           data.loc[i] = array
       print("Done")
   
       # Convert the dataframe to arrow and then save to disk as parquet
       print("Saving to disk...")
       arrow_data = pa.Table.from_pandas(data)
       pq.write_table(arrow_data, filename)
       print("Done")
   
   # Reading the data
   try:
       print("Option 1:")
       pq.read_table(filename)
   except OSError as e:
       print("Option 1 failed with the following error:\n", e)
   
       print("Option 2:")
       print("Memory usage is blowing up here...")
       limit = 2**31-1 # Maximum value I can choose
       pq.read_table(filename, thrift_string_size_limit = limit, 
thrift_container_size_limit = limit)
       gc.collect()
   
       print("Option 3:")
       print("This option either takes too long or it just gets stuck...")
       print("Notice how the memory usage is still high from option 2, even 
after calling the garbage collector.")
       pd.read_parquet(filename, engine = 'fastparquet')
   
   ```
   
   If you run the script you will notice that:
   1) `parquet.read_table()` fails with `OSError :  Couldn't deserialize 
thrift: TProtocolException: Exceeded size limit`
   2) Increasing the `thrift_..._size_limit` options to the maximum value 
solves the problem but makes the memory usage blow up and the garbage collector 
doesn't collect that memory after reading.
   3) Reading with pandas and the `fastparquet` engine doesn't work, it seems 
that it either gets stuck or it takes too long.
   
   Since the data doesn't actually take up that much memory, I don't think this 
should be happening. I'm a bit worried that bigger datasets in the future might 
give me more problems.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python][Parquet] Very high memory usage when reading from disk [arrow]

Reply via email to