Re: [I] [Python][Parquet] Very high memory usage when reading from disk [arrow]

via GitHub Sun, 19 Nov 2023 00:25:01 -0800


mapleFU commented on issue #38552:
URL: https://github.com/apache/arrow/issues/38552#issuecomment-1817785040


   Sorry for late replying because I'm lazy
   
   > 
   Increasing the thrift_..._size_limit options to the maximum value solves the 
problem but makes the memory usage blow up and the garbage collector doesn't 
collect that memory after reading.
   
   I think that this is because your metadata is too large. And this is apart 
from https://github.com/apache/arrow/issues/38245 . Parquet metadata is a 
thrift binary. Would you mind print the fileMetadata and it's size?
   
   In the data part, I think you might have too many columns, but the data size 
is not too large. So deserializing the parquet will be hard to deserializing 
the metadata. Csv doesn't has this metadata, so it will be easy to parsing them.
   
   Disable the statistics might help
   
   @Hugo-loio 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python][Parquet] Very high memory usage when reading from disk [arrow]

Reply via email to