timothydijamco commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2607664288
Thanks for helping look into this > > Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file. > > How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files. We typically use one row group per file For some additional background, one of the situations where we originally observed high memory usage is this: * Dataset has ~3000 rows per row group (and per file) and 5000 columns * User is reading 3 columns In that dataset I observed that the length of the metadata region in one of the .parquet files is 1082066 bytes, and since the metadata region is read in full, the reader needs to read ~120 bytes of metadata-region-data per data value -- so I think it would be expected if there's some memory usage overhead because of this. However I think what our main concern is is that the memory usage doesn't seem to be constant -- it constantly increases and isn't freed after the read is done > Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point? I think it's peak memory usage after loading the table into an Arrow Table. However, I'm not sure about whether the dataset object being alive or not. I'll work on a C++ repro and share here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
