lupko opened a new issue, #36540:
URL: https://github.com/apache/arrow/issues/36540

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When a PyArrow FlightClient calls do_exchange and the call fails on server 
and the client holds onto the raised `FlightError`, then it seems that not all 
memory that was used during do_exchange processing is returned.
   
   ---
   
   Let me provide some more context to paint a better picture. In our system, 
we have a Flight RPC service that can generate data using Flight Descriptor 
containing a command. The command specifies a flight path of source data, 
flight path where to sink the data + some payload for the processing. 
   
   The command processing may take a while - it is a long running command and 
the service represents this internally as a task that it queues up. The clients 
'submit' these tasks via GetFlightInfo and keep polling on the result.
   
   When the task runs, it needs to do some expensive computation: this is what 
we have to offload to a worker process via DoExchange. The worker process has 
its own Flight RPC server (bound to unix socket).
   
   The task will start DoExchange with the worker process, then open stream of 
the source data and shovel it to the worker. At this point, the worker does its 
work -> which may fail. On failure, the main process captures the error and 
makes it a result of the task - so that when client comes polling for the task 
result, they learn that it failed with that particular error.
   
   What we found during load testing error scenarios is, that the RSS of the 
server (e.g. the main process) only ever goes up.
   
   Note that there is no funny stuff on success. The worker produces data, the 
main process does DoPut for the sink flight path and streams the data where 
necessary. Memory usage in this case is as expected.
   
   ---
   
   After a lot of head scratching I think I have finally pin-pointed this 
suspicious memory usage to the fact that the main process holds onto the 
exception that failed the DoExchange. If the code does not keep the exception, 
the memory usage is ok. As soon as it holds onto it, the memory accumulates. 
The rate of accumulation seems to correlate with the size of the source data.
   
   ---
   
   Here is a gist with the reproducer that I cut to the bare-bones: 
https://gist.github.com/lupko/ce59226b99790af2474bfe459cb4e36d
   
   The reproducer does `malloc_trim` to drop any unused memory that malloc has 
not returned to the system. You will be able to observe that the trim with 
accumulated errors does not impact process RSS at all. Only when the 
accumulated errors are dropped, the memory usage goes down (significantly).
   
   
   
   ### Component(s)
   
   FlightRPC, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to