pitrou commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2727949708


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+  return contents_->total_compressed_bytes() +
+         contents_->total_compressed_bytes_written() +
+         contents_->EstimatedBufferedValueBytes();

Review Comment:
   > 1. Totally ignore buffered values. This under-estimates the row group size.
   > 2. Use an empirical compression ratio, which is imprecise and difficult to 
decide.
   > 3. Directly use the uncompressed but encoded size to estimate. This 
over-estimates the row group size.
   
   The under-estimation in 1 will be small as soon as there are many pages, 
while the over-estimation in 3 may be huge in case compression is excellent.
   
   Also 1 is simply simpler to implement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to