kevinwilfong opened a new pull request, #10661:
URL: https://github.com/apache/incubator-gluten/pull/10661

   ## What changes are proposed in this pull request?
   
   Velox supports a config preferred_output_batch_bytes which is a soft/best 
effort limit on the number of bytes consumed in
   a batch output by an Operator. It is used alongside max_output_batch_rows 
and in place of preferred_output_batch_rows, and applies when the number of 
bytes in rows can be estimated and that estimate exceeds 
preferred_output_batch_bytes before max_output_batch_rows is reached. This can 
help the system to adapt to very wide rows without OOM'ing.
   
   In addition to plumbing this config through Spark to Velox, I also modified 
RowToVeloxColumnarExec to respect preferred_output_batch_bytes. Here I limit 
the Arrow buffer it allocates to hold preferred_output_batch_bytes + whatever 
overflow is needed to hold the last row that causes us to exceed the limit 
(again it's a soft limit).
   
   We are currently using the default in Velox so users shouldn't see any 
change in behavior there unless they choose to set the config. Users may see 
smaller batches coming from RowToVeloxColumnarExec if the rows are currently 
exceeding the limit as we were previously not imposing the limit there.
   
   ## How was this patch tested?
   
   Verified locally, we were seeing OOMs coming from RowToVeloxColumnarExec due 
to very wide rows.
   
   Added a unit test to check that preferred_output_batch_bytes is respected in 
RowToVeloxColumnarExec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to