westonpace commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-2002958716

   Thank you!  I looked at this today.  It's a bug.  It was probably introduced 
when we switched the dataset writer over to using some more generic tools for 
backpressure (the async task scheduler).  Apologies in advance for the long 
boring explanation :)
   
   The dataset writer does not assume the underlying file writer is re-entrant 
(I can't remember if the parquet writer is reentrant or not).  When a batch 
comes in for file X and a write task is already running on file X then we queue 
that batch up.  We call this data "in flight" and we have a special throttle 
for how many rows we can have in flight (it's not configurable and set to 8Mi). 
 When this throttle is full it sends a signal to the source to pause.
   
   All of this is actually working correctly.  The problem is that, when it 
pauses the source, a few extra tasks leak in because they were already running. 
 This is kind of ok, but then it unpauses, fills up, and pauses again, and a 
few more extra tasks leak in. This process repeats...a lot.  By the time it 
crashed on my machine there was over a thousand extra tasks.  Because all these 
tasks are getting in the source thinks the data is being consumed and it keeps 
reading.  This becomes uncontrolled memory growth and everything crashes.
   
   I suspect this is related to partitioning because you end up with lots of 
tiny writes and the in flight throttle fills and empties A LOT during 
execution.  You might be able to get things to pass if you set 
`min_rows_per_group` to some largish value although then you run into a 
different kind of throttle (max rows "staged") and so it might not help.
   
   The proper fix would be to not release the throttle until the extra tasks 
that snuck in the last time it was paused have been launched.  I am doing some 
arrow-cpp work this week and will try and get a look at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to