Re: [PR] feat: add max_row_group_bytes option to ParquetOptions [datafusion]

via GitHub Thu, 04 Jun 2026 04:15:21 -0700


Satyr09 commented on PR #22649:
URL: https://github.com/apache/datafusion/pull/22649#issuecomment-4621624724


   Hi @2010YOUY01 , thank you so much for your reviews, and the approval!
   
   Heads-up before merge: while adding the end-to-end tests, I found that 
`max_row_group_bytes` is currently honored only by the single-threaded Parquet 
writer.
   
   The reason, as far as my understanding goes is that the serial path hands 
the whole write loop to arrow-rs's `ArrowWriter` (boundary logic and all), so 
the byte-flush is free. The parallel path reimplements the row-group boundary 
loop itself, which only ever consulted `max_row_group_row_count` - so 
`max_row_group_bytes` was never checked.
   
   This PR's e2e tests reflect that  - we do ``allow_single_file_parallelism = 
false` in order to run e2e testing honouring the new config.
   
   I am working on a follow up that teaches the parallel writer in DF to honour 
this config as well - but I suspect that would be a larger change and might be 
better served by being its own PR.
   
   Given this, would you still be okay with merging this as is (with a note 
mentioning that this config currently only works when 
allow_single_file_parallelism is also switched off)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add max_row_group_bytes option to ParquetOptions [datafusion]

Reply via email to