wecharyu commented on PR #48468: URL: https://github.com/apache/arrow/pull/48468#issuecomment-3998954732
> I still think we should not try to estimate anything here. @pitrou The first row group seems must depends on the estimated data, otherwise the `max_row_group_bytes` could not take effect on it. Many other implementations like `parquet-java` and `arrow-rs` use both compressed page data and encoded buffered bytes to estimate the remaining rows of a row group. Given that the uncompressed buffered bytes are typically a small portion of the total footprint, would it be reasonable to rely on a similar estimation approach here as well? CC: @wgtmac -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
