lupko opened a new issue, #36540: URL: https://github.com/apache/arrow/issues/36540
### Describe the bug, including details regarding any error messages, version, and platform. When a PyArrow FlightClient calls do_exchange and the call fails on server and the client holds onto the raised `FlightError`, then it seems that not all memory that was used during do_exchange processing is returned. --- Let me provide some more context to paint a better picture. In our system, we have a Flight RPC service that can generate data using Flight Descriptor containing a command. The command specifies a flight path of source data, flight path where to sink the data + some payload for the processing. The command processing may take a while - it is a long running command and the service represents this internally as a task that it queues up. The clients 'submit' these tasks via GetFlightInfo and keep polling on the result. When the task runs, it needs to do some expensive computation: this is what we have to offload to a worker process via DoExchange. The worker process has its own Flight RPC server (bound to unix socket). The task will start DoExchange with the worker process, then open stream of the source data and shovel it to the worker. At this point, the worker does its work -> which may fail. On failure, the main process captures the error and makes it a result of the task - so that when client comes polling for the task result, they learn that it failed with that particular error. What we found during load testing error scenarios is, that the RSS of the server (e.g. the main process) only ever goes up. Note that there is no funny stuff on success. The worker produces data, the main process does DoPut for the sink flight path and streams the data where necessary. Memory usage in this case is as expected. --- After a lot of head scratching I think I have finally pin-pointed this suspicious memory usage to the fact that the main process holds onto the exception that failed the DoExchange. If the code does not keep the exception, the memory usage is ok. As soon as it holds onto it, the memory accumulates. The rate of accumulation seems to correlate with the size of the source data. --- Here is a gist with the reproducer that I cut to the bare-bones: https://gist.github.com/lupko/ce59226b99790af2474bfe459cb4e36d The reproducer does `malloc_trim` to drop any unused memory that malloc has not returned to the system. You will be able to observe that the trim with accumulated errors does not impact process RSS at all. Only when the accumulated errors are dropped, the memory usage goes down (significantly). ### Component(s) FlightRPC, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
