wecharyu commented on PR #48468:
URL: https://github.com/apache/arrow/pull/48468#issuecomment-4012592452

   > Well, the already compressed page data is sufficient to get a lower bound 
estimate.
   
   The input batch size is uncertain. If we use the written compressed page 
data to determine whether to flush a new row group, we may need to probe the 
appropriate batch size for each write. Otherwise, writing the entire batch at 
once could cause the row group size to exceed `max_row_group_bytes` by a large 
margin.
   
   It could make things more complicated. Conversely, estimating the remaining 
number of rows based on total values appears to be a more concise approach. 
It's like the arrow-rs used `get_estimated_total_bytes` for batch split:
   
https://github.com/apache/arrow-rs/blob/5ba451531efd2e98de38f6a8443aad605b6b5cc5/parquet/src/arrow/arrow_writer/mod.rs#L354-L380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to