Satyr09 commented on PR #22649: URL: https://github.com/apache/datafusion/pull/22649#issuecomment-4621624724
Hi @2010YOUY01 , thank you so much for your reviews, and the approval! Heads-up before merge: while adding the end-to-end tests, I found that `max_row_group_bytes` is currently honored only by the single-threaded Parquet writer. The reason, as far as my understanding goes is that the serial path hands the whole write loop to arrow-rs's `ArrowWriter` (boundary logic and all), so the byte-flush is free. The parallel path reimplements the row-group boundary loop itself, which only ever consulted `max_row_group_row_count` - so `max_row_group_bytes` was never checked. This PR's e2e tests reflect that - we do ``allow_single_file_parallelism = false` in order to run e2e testing honouring the new config. I am working on a follow up that teaches the parallel writer in DF to honour this config as well - but I suspect that would be a larger change and might be better served by being its own PR. Given this, would you still be okay with merging this as is (with a note mentioning that this config currently only works when allow_single_file_parallelism is also switched off)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
