shangxinli commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2777674119


##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
    */
   String OUTPUT_SPEC_ID = "output-spec-id";
 
+  /**
+   * Use Parquet row-group level merging during rewrite operations when 
applicable.
+   *
+   * <p>When enabled, Parquet files will be merged at the row-group level by 
directly copying row

Review Comment:
   @lintingbin the compression is at page level. If your streaming checkpoint 
interval can make the typical page size (default 1MB), we can consider later to 
add a PR to merge at page level. In 
[Parquet](https://github.com/apache/parquet-java), we have made changes to 
rewrite parquet files without decompression-and-then-compression. For example, 
we do encryption (also at page level) rewrite in that way.  We walk though each 
page, without decoding-decompression, and immediately do encryption on that 
page and send it to disk. It gains a few times faster than record by record 
rewrite. But that is a more complex change. We can do that later. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to