westonpace commented on issue #40224: URL: https://github.com/apache/arrow/issues/40224#issuecomment-2002958716
Thank you! I looked at this today. It's a bug. It was probably introduced when we switched the dataset writer over to using some more generic tools for backpressure (the async task scheduler). Apologies in advance for the long boring explanation :) The dataset writer does not assume the underlying file writer is re-entrant (I can't remember if the parquet writer is reentrant or not). When a batch comes in for file X and a write task is already running on file X then we queue that batch up. We call this data "in flight" and we have a special throttle for how many rows we can have in flight (it's not configurable and set to 8Mi). When this throttle is full it sends a signal to the source to pause. All of this is actually working correctly. The problem is that, when it pauses the source, a few extra tasks leak in because they were already running. This is kind of ok, but then it unpauses, fills up, and pauses again, and a few more extra tasks leak in. This process repeats...a lot. By the time it crashed on my machine there was over a thousand extra tasks. Because all these tasks are getting in the source thinks the data is being consumed and it keeps reading. This becomes uncontrolled memory growth and everything crashes. I suspect this is related to partitioning because you end up with lots of tiny writes and the in flight throttle fills and empties A LOT during execution. You might be able to get things to pass if you set `min_rows_per_group` to some largish value although then you run into a different kind of throttle (max rows "staged") and so it might not help. The proper fix would be to not release the throttle until the extra tasks that snuck in the last time it was paused have been launched. I am doing some arrow-cpp work this week and will try and get a look at this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
