[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7389: Spark 3.4: Implement rewrite position deletes

via GitHub Thu, 27 Apr 2023 13:01:00 -0700


aokolnychyi commented on code in PR #7389:
URL: https://github.com/apache/iceberg/pull/7389#discussion_r1179623267



##########
api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java:
##########
@@ -28,6 +32,80 @@
 public interface RewritePositionDeleteFiles
     extends SnapshotUpdate<RewritePositionDeleteFiles, 
RewritePositionDeleteFiles.Result> {
 
+  /**
+   * Enable committing groups of files (see max-file-group-size-bytes) prior 
to the entire rewrite
+   * completing. This will produce additional commits but allow for progress 
even if some groups
+   * fail to commit. This setting will not change the correctness of the 
rewrite operation as file
+   * groups can be compacted independently.
+   *
+   * <p>The default is false, which produces a single commit when the entire 
job has completed.
+   */
+  String PARTIAL_PROGRESS_ENABLED = "partial-progress.enabled";
+
+  boolean PARTIAL_PROGRESS_ENABLED_DEFAULT = false;
+
+  /**
+   * The maximum amount of Iceberg commits that this rewrite is allowed to 
produce if partial
+   * progress is enabled. This setting has no effect if partial progress is 
disabled.
+   */
+  String PARTIAL_PROGRESS_MAX_COMMITS = "partial-progress.max-commits";
+
+  int PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT = 10;
+
+  /**
+   * The entire rewrite operation is broken down into pieces based on 
partitioning and within
+   * partitions based on size into groups. These sub-units of the rewrite are 
referred to as file
+   * groups. The largest amount of data that should be compacted in a single 
group is controlled by
+   * {@link #MAX_FILE_GROUP_SIZE_BYTES}. This helps with breaking down the 
rewriting of very large
+   * partitions which may not be rewritable otherwise due to the resource 
constraints of the
+   * cluster. For example a sort based rewrite may not scale to terabyte sized 
partitions, those
+   * partitions need to be worked on in small subsections to avoid exhaustion 
of resources.
+   *
+   * <p>When grouping files, the underlying rewrite strategy will use this 
value as to limit the
+   * files which will be included in a single file group. A group will be 
processed by a single
+   * framework "action". For example, in Spark this means that each group 
would be rewritten in its
+   * own Spark action. A group will never contain files for multiple output 
partitions.
+   */
+  String MAX_FILE_GROUP_SIZE_BYTES = "max-file-group-size-bytes";
+
+  long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT = 1024L * 1024L * 1024L * 100L; // 
100 Gigabytes

Review Comment:
   If we decide to keep this config and default value here, let's switch the 
order of constants, it is easier to read.
   
   ```
   100 * 1024L * 1024L * 1024L
   ```



##########
api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java:
##########
@@ -28,6 +32,80 @@
 public interface RewritePositionDeleteFiles
     extends SnapshotUpdate<RewritePositionDeleteFiles, 
RewritePositionDeleteFiles.Result> {
 
+  /**
+   * Enable committing groups of files (see max-file-group-size-bytes) prior 
to the entire rewrite
+   * completing. This will produce additional commits but allow for progress 
even if some groups
+   * fail to commit. This setting will not change the correctness of the 
rewrite operation as file
+   * groups can be compacted independently.
+   *
+   * <p>The default is false, which produces a single commit when the entire 
job has completed.
+   */
+  String PARTIAL_PROGRESS_ENABLED = "partial-progress.enabled";
+
+  boolean PARTIAL_PROGRESS_ENABLED_DEFAULT = false;
+
+  /**
+   * The maximum amount of Iceberg commits that this rewrite is allowed to 
produce if partial
+   * progress is enabled. This setting has no effect if partial progress is 
disabled.
+   */
+  String PARTIAL_PROGRESS_MAX_COMMITS = "partial-progress.max-commits";
+
+  int PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT = 10;
+
+  /**
+   * The entire rewrite operation is broken down into pieces based on 
partitioning and within
+   * partitions based on size into groups. These sub-units of the rewrite are 
referred to as file
+   * groups. The largest amount of data that should be compacted in a single 
group is controlled by
+   * {@link #MAX_FILE_GROUP_SIZE_BYTES}. This helps with breaking down the 
rewriting of very large
+   * partitions which may not be rewritable otherwise due to the resource 
constraints of the
+   * cluster. For example a sort based rewrite may not scale to terabyte sized 
partitions, those
+   * partitions need to be worked on in small subsections to avoid exhaustion 
of resources.
+   *
+   * <p>When grouping files, the underlying rewrite strategy will use this 
value as to limit the
+   * files which will be included in a single file group. A group will be 
processed by a single
+   * framework "action". For example, in Spark this means that each group 
would be rewritten in its
+   * own Spark action. A group will never contain files for multiple output 
partitions.
+   */
+  String MAX_FILE_GROUP_SIZE_BYTES = "max-file-group-size-bytes";
+
+  long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT = 1024L * 1024L * 1024L * 100L; // 
100 Gigabytes

Review Comment:
   If we decide to keep this config and default value here, let's switch the 
order of constants, it is easier to read.
   
   ```
   100 * 1024L * 1024L * 1024L // 100 GB
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7389: Spark 3.4: Implement rewrite position deletes

Reply via email to