kevinwilfong opened a new pull request, #10661: URL: https://github.com/apache/incubator-gluten/pull/10661
## What changes are proposed in this pull request? Velox supports a config preferred_output_batch_bytes which is a soft/best effort limit on the number of bytes consumed in a batch output by an Operator. It is used alongside max_output_batch_rows and in place of preferred_output_batch_rows, and applies when the number of bytes in rows can be estimated and that estimate exceeds preferred_output_batch_bytes before max_output_batch_rows is reached. This can help the system to adapt to very wide rows without OOM'ing. In addition to plumbing this config through Spark to Velox, I also modified RowToVeloxColumnarExec to respect preferred_output_batch_bytes. Here I limit the Arrow buffer it allocates to hold preferred_output_batch_bytes + whatever overflow is needed to hold the last row that causes us to exceed the limit (again it's a soft limit). We are currently using the default in Velox so users shouldn't see any change in behavior there unless they choose to set the config. Users may see smaller batches coming from RowToVeloxColumnarExec if the rows are currently exceeding the limit as we were previously not imposing the limit there. ## How was this patch tested? Verified locally, we were seeing OOMs coming from RowToVeloxColumnarExec due to very wide rows. Added a unit test to check that preferred_output_batch_bytes is respected in RowToVeloxColumnarExec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
