westonpace opened a new issue, #36951:
URL: https://github.com/apache/arrow/issues/36951

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This issue was originally reported on the mailing list.
   
   The problem is that we have two throttles.
   
   The explicit and intentional throttle (max staged rows) and an implicit 
throttle in the throttled scheduler (we can only run on queue task at a time).  
I've will attach a diagram in the comments.
   
   Generally I assumed the queueing stage would be very fast (faster than the 
disk writes at least, but I assumed it would be faster than the sources too) 
and the first (blue) queue would typically be quite small.  In this situation, 
this is not the case.  There is a very fast read pipeline (just generating some 
data) that runs in parallel.
   
   It's possible that the queueing stage itself is the bottleneck (even slower 
than the writes) but I don't think this is the case.  What happens for me at 
least is that the second (red) queue fills up and backpressure is applied.
   
   So, for an example, we can assume that backpressure is applied and we have 
100 blue tasks queued up and 100 red tasks queued up.
   
   Then, when a single write task completes we grab one red task and send a 
resume signal.  This triggers a bunch of readers to start up again (let's 
assume there are 32) and they quickly deliver more tasks and get paused again.  
Now we have something like...
   
   131 blue tasks queued up and 100 red tasks queued up
   
   Each time a red task completes the cycle repeats.  Eventually the blue queue 
builds up (which is why memory keeps growing).  Also, every time we pull an 
item out of the blue queue we do a sort of "notify all" on all the waiting blue 
tasks to see which one can run next.  I don't actually see the reported 800% 
CPU when I run repr.py on my system but, based on the stack trace, this 
processing of the "notify all" is what is keeping those CPUs busy.
   
   I've confirmed, with OP's reproducible case (thank you!), that this blue 
queue ends up with millions of tasks which is definitely wrong.
   
   We don't see this in unit tests because our unit tests just confirm that 
backpressure is applied when the writer can't keep up (and it is applied) and 
that the readers are asked to pause (they are).
   
   I don't see this when I run my end-to-end tests because my data source is 
parquet files (when I'm testing backpressure I'm usually testing repartitioning 
a large out-of-core dataset) and the data source is slow enough that the blue 
queue never really fills up.
   
   I think the easiest fix will be to add backpressure to the throttled 
scheduler.  We can probably even simplify the dataset writer a bit.  It might 
be interesting to investigate sometime what BPS the queueing stage can run at.  
If that stage ends up being slower than the disk then we might want to shrink 
that critical section.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to