wgtmac commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2719867964


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+  return contents_->total_compressed_bytes() +
+         contents_->total_compressed_bytes_written() +
+         contents_->EstimatedBufferedValueBytes();

Review Comment:
   I think we have three options:
   
   1. Totally ignore buffered values. This under-estimates the row group size.
   2. Use an empirical compression ratio, which is imprecise and difficult to 
decide.
   3. Directly use the uncompressed but encoded size to estimate. This 
over-estimates the row group size.
   
   I still prefer 3 to include this part but do not take the complexity to 
adjust a ratio because each column can have only one buffered page and their 
size should be small compared to those compressed pages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to