mapleFU commented on issue #35393: URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536548630
I'm not sure, but I guess thats excatly the reason. I'm not familiar with profiling in python part. In C++ maybe I should write some script and using jemalloc to dump the memory size each object occupied, which could be troublesome. Parquet uses thrift here : https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1011-L1071 . And it using compact thrift as encoding protocol. When inspect, you don't need `row_groups` here, but it will be deserialzed and become a huge object in memory. From 80MB to 288MB is possible. When parsing, extra buffer would be used, causing larger burst memory. Here you excatly only want `schema`, but the whole footer will be deserialized and loaded into memory. I guess there are no convenient way to only deserialize the `schema` currently. So I suggest you to use a larger row-group size, so that you can have less row-groups here, which may causing less memory foot print. By the way, storing statistics for nearly 5000 columns are also heavy. You can first change it to using less row-groups, and check out the rss it occupied. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
