mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536548630

   I'm not sure, but I guess thats excatly the reason. I'm not familiar with 
profiling in python part. In C++ maybe I should write some script and using 
jemalloc to dump the memory size each object occupied, which could be 
troublesome.
   
   Parquet uses thrift here : 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1011-L1071
 . And it using compact thrift as encoding protocol.
   
   When inspect, you don't need `row_groups` here, but it will be deserialzed 
and become a huge object in memory. From 80MB to 288MB is possible. When 
parsing, extra buffer would be used, causing larger burst memory.
   
   Here you excatly only want `schema`, but the whole footer will be 
deserialized and loaded into memory. I guess there are no convenient way to 
only deserialize the `schema` currently. So I suggest you to use a larger 
row-group size, so that you can have less row-groups here, which may causing 
less memory foot print.
   
   By the way, storing statistics for nearly 5000 columns are also heavy. You 
can first change it to using less row-groups, and check out the rss it occupied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to