westonpace commented on pull request #11556: URL: https://github.com/apache/arrow/pull/11556#issuecomment-953269749
Yes, the parquet writer has a configurable max row group size but it does not have a configurable min row group size. The latter is helpful in particular for dataset writing because each incoming batch is split into N smaller partition batches. If we then turn around and write those batches immediately we can often end up with a bunch of small row groups which is undesirable. Also, the behavior of the max row group size is not quite what I'd want. For example, if the max row group size is 1 million rows and I send a bunch of batches with 1.1 million rows then I'll end up with a bunch of row groups with 1 million rows and a bunch of row groups with 100k rows. We could push all these features down into the writers themselves I suppose. It might be better from a separation of concerns point of view. Although it would make it a little harder to enforce `max_rows_staged` unless we also added "force write" operation to the writers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
