Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

via GitHub Wed, 22 Jan 2025 00:19:52 -0800


pitrou commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2606567048


   > I was expecting the metadata memory usage to be more of O(C) where 
C=number_columns instead of O(C * F) where C=number_columns and F=number_files? 
Since once a parquet file is loaded to pyarrow Table, we don't need to keep the 
metadata around (all files have the same scheme), but perhaps I am 
misunderstanding how read parquet works.
   
   Hmm, this needs clarifying a bit then :) What do the memory usage numbers 
you posted represent? Is it peak memory usage? Is it memory usage after loading 
the dataset as a Arrow table? Is the dataset object still alive at that point?
   
   > It would be great to reduce metadata memory usage when the files being 
read all have the same schema since this is a quite common case I think
   
   Definitely. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

Reply via email to