Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

via GitHub Tue, 21 Jan 2025 11:59:15 -0800


pitrou commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2605609435


   I haven't tried to look down for the precise source of memory consumption 
(yet?) but some quick comments already:
   
   > When running the code above with "time -v", it shows the memory usage is 
about 6G, which is significantly larger than the data loaded so I think there 
is some metadata related memory leak
   
   A quick back of the envelope calculation says that this is roughly 2 kB per 
column per file.
   
   > I also noticed that the memory usage increases if I use longer column 
names, e.g., if I prepend a 128 char long prefix to the column names, the 
memory usage is about 11G.
   
   Interesting data point. That would be 4 kB per column per file, so quite a 
bit of additional overhead just for 128 additional characters...
   
   > Each partition has a single row, and 10k double columns.
   
   I would stress that "a single row and 10 kB columns" is never going to be a  
good use case for Parquet, which is designed from the ground up as a columnar 
format. If you're storing less than e.g. 1k rows (regardless of the number of 
columns), the format will certainly impose a lot of overhead.
   
   Of course, we can still try to find out if there's some low-hanging fruit 
that would allow reducing the memory usage of metadata.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

Reply via email to