wecharyu commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2741038067


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+  return contents_->total_compressed_bytes() +
+         contents_->total_compressed_bytes_written() +
+         contents_->EstimatedBufferedValueBytes();

Review Comment:
   @pitrou If we ignore the buffered values, under-estimation will be huge in 1 
when many columns still not finish their first page. Conversely, if there are 
many pages, this overestimation will not be significant.
   
   Let's make an assumption:
   A row group with 1000 pages, a page with 1000 rows;
   Compressed row size: 10, buffered row size: 100 (10X)
   Then the total avg size is (1000 * 1000 * 10 + 1000 * 100) / (1000 * 1000 + 
1000) = 10.09  - nearly to compressed size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to