westonpace opened a new issue, #36951: URL: https://github.com/apache/arrow/issues/36951
### Describe the bug, including details regarding any error messages, version, and platform. This issue was originally reported on the mailing list. The problem is that we have two throttles. The explicit and intentional throttle (max staged rows) and an implicit throttle in the throttled scheduler (we can only run on queue task at a time). I've will attach a diagram in the comments. Generally I assumed the queueing stage would be very fast (faster than the disk writes at least, but I assumed it would be faster than the sources too) and the first (blue) queue would typically be quite small. In this situation, this is not the case. There is a very fast read pipeline (just generating some data) that runs in parallel. It's possible that the queueing stage itself is the bottleneck (even slower than the writes) but I don't think this is the case. What happens for me at least is that the second (red) queue fills up and backpressure is applied. So, for an example, we can assume that backpressure is applied and we have 100 blue tasks queued up and 100 red tasks queued up. Then, when a single write task completes we grab one red task and send a resume signal. This triggers a bunch of readers to start up again (let's assume there are 32) and they quickly deliver more tasks and get paused again. Now we have something like... 131 blue tasks queued up and 100 red tasks queued up Each time a red task completes the cycle repeats. Eventually the blue queue builds up (which is why memory keeps growing). Also, every time we pull an item out of the blue queue we do a sort of "notify all" on all the waiting blue tasks to see which one can run next. I don't actually see the reported 800% CPU when I run repr.py on my system but, based on the stack trace, this processing of the "notify all" is what is keeping those CPUs busy. I've confirmed, with OP's reproducible case (thank you!), that this blue queue ends up with millions of tasks which is definitely wrong. We don't see this in unit tests because our unit tests just confirm that backpressure is applied when the writer can't keep up (and it is applied) and that the readers are asked to pause (they are). I don't see this when I run my end-to-end tests because my data source is parquet files (when I'm testing backpressure I'm usually testing repartitioning a large out-of-core dataset) and the data source is slow enough that the blue queue never really fills up. I think the easiest fix will be to add backpressure to the throttled scheduler. We can probably even simplify the dataset writer a bit. It might be interesting to investigate sometime what BPS the queueing stage can run at. If that stage ends up being slower than the disk then we might want to shrink that critical section. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
