pitrou commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2605609435
I haven't tried to look down for the precise source of memory consumption (yet?) but some quick comments already: > When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak A quick back of the envelope calculation says that this is roughly 2 kB per column per file. > I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G. Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters... > Each partition has a single row, and 10k double columns. I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet, which is designed from the ground up as a columnar format. If you're storing less than e.g. 1k rows (regardless of the number of columns), the format will certainly impose a lot of overhead. Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
