Jokser commented on issue #13837:
URL: https://github.com/apache/arrow/issues/13837#issuecomment-1210511211

   Okay, It seems that the remaining data in RSS is not directly related to 
arrow.
   
   Let me describe my data flow:
   
   I have a generated tpcds dataset in Parquet format (with 10/100/1000GB scale 
factor).
   A tool writes parquet files into my Arrow Flight backed server opening up to 
128 connections (concurrent `doPut` calls).
   Each arrow::RecordBatch consumed from `doPut` stream goes to async 
transformation to some internal format. The lifetime of `arrow::RecordBatch` 
object is short.
   
   What I see in the case of 1000GB dataset.
   The peak memory usage is ~128GB. After tool work is finished and all 
connections are closed I still see ~128GB in RSS.
   When I do manual call `arrow::system_memory_pool()->ReleaseUnused();` almost 
all memory is freed:
   
   ```
   [2022-08-10 13:36:22.402] [se_logger] [info] RSS before default pool 
release: 133422698496 bytes
   [2022-08-10 13:36:22.405] [se_logger] [info] RSS after default pool release: 
133424791552 bytes
   [2022-08-10 13:36:28.965] [se_logger] [info] RSS after system pool release: 
5903564800 bytes
   ```
   
   I build gRPC/Protobuf and Arrow from sources.
   Here is a snippet of how we build it:
   https://gist.github.com/Jokser/268a82428ceb00144519825029a469d7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to