aokolnychyi commented on code in PR #7389:
URL: https://github.com/apache/iceberg/pull/7389#discussion_r1179625482
##########
api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java:
##########
@@ -28,6 +32,80 @@
public interface RewritePositionDeleteFiles
extends SnapshotUpdate<RewritePositionDeleteFiles,
RewritePositionDeleteFiles.Result> {
+ /**
+ * Enable committing groups of files (see max-file-group-size-bytes) prior
to the entire rewrite
+ * completing. This will produce additional commits but allow for progress
even if some groups
+ * fail to commit. This setting will not change the correctness of the
rewrite operation as file
+ * groups can be compacted independently.
+ *
+ * <p>The default is false, which produces a single commit when the entire
job has completed.
+ */
+ String PARTIAL_PROGRESS_ENABLED = "partial-progress.enabled";
+
+ boolean PARTIAL_PROGRESS_ENABLED_DEFAULT = false;
+
+ /**
+ * The maximum amount of Iceberg commits that this rewrite is allowed to
produce if partial
+ * progress is enabled. This setting has no effect if partial progress is
disabled.
+ */
+ String PARTIAL_PROGRESS_MAX_COMMITS = "partial-progress.max-commits";
+
+ int PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT = 10;
+
+ /**
+ * The entire rewrite operation is broken down into pieces based on
partitioning and within
+ * partitions based on size into groups. These sub-units of the rewrite are
referred to as file
+ * groups. The largest amount of data that should be compacted in a single
group is controlled by
+ * {@link #MAX_FILE_GROUP_SIZE_BYTES}. This helps with breaking down the
rewriting of very large
+ * partitions which may not be rewritable otherwise due to the resource
constraints of the
+ * cluster. For example a sort based rewrite may not scale to terabyte sized
partitions, those
+ * partitions need to be worked on in small subsections to avoid exhaustion
of resources.
+ *
+ * <p>When grouping files, the underlying rewrite strategy will use this
value as to limit the
+ * files which will be included in a single file group. A group will be
processed by a single
+ * framework "action". For example, in Spark this means that each group
would be rewritten in its
+ * own Spark action. A group will never contain files for multiple output
partitions.
+ */
+ String MAX_FILE_GROUP_SIZE_BYTES = "max-file-group-size-bytes";
+
+ long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT = 1024L * 1024L * 1024L * 100L; //
100 Gigabytes
+
+ /**
+ * The max number of file groups to be simultaneously rewritten by the
rewrite strategy. The
+ * structure and contents of the group is determined by the rewrite
strategy. Each file group will
+ * be rewritten independently and asynchronously.
+ */
+ String MAX_CONCURRENT_FILE_GROUP_REWRITES =
"max-concurrent-file-group-rewrites";
+
+ int MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT = 5;
+
+ /**
+ * The output file size that this rewrite strategy will attempt to generate
when rewriting files.
+ * By default this will use the "write.target-file-size-bytes value" in the
table properties of
Review Comment:
I believe this is also part of the rewriter API now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]