[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4629: Spark: Add positional and equality delete file count to ExpireSnapshot results

GitBox Wed, 18 May 2022 09:56:45 -0700


aokolnychyi commented on code in PR #4629:
URL: https://github.com/apache/iceberg/pull/4629#discussion_r876107119



##########
.palantir/revapi.yml:
##########
@@ -27,6 +27,12 @@ acceptedBreaks:
     - code: "java.method.addedToInterface"
       new: "method ThisT 
org.apache.iceberg.SnapshotUpdate<ThisT>::scanManifestsWith(java.util.concurrent.ExecutorService)"
       justification: "Accept all changes prior to introducing API 
compatibility checks"
+    - code: "java.method.addedToInterface"

Review Comment:
   I am okay with this given other changes, upcoming 1.0 and that it is mostly 
Iceberg itself that provides implementations of this interface. Even if someone 
has custom actions, they probably reuse the provided result implementation. 
   
   If other folks are concerned, we could default the new methods. 



##########
api/src/main/java/org/apache/iceberg/ManifestContent.java:
##########
@@ -35,4 +35,12 @@ public enum ManifestContent {
   public int id() {
     return id;
   }
+
+  public static ManifestContent fromId(int id) {
+    switch (id) {

Review Comment:
   nit: would it be safer to throw an exception if `id` is unknown just in case?



##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestExpireSnapshotsAction.java:
##########
@@ -122,13 +140,18 @@ private Long rightAfterSnapshot(long snapshotId) {
     return end;
   }
 
-  private void checkExpirationResults(long expectedDatafiles, long 
expectedManifestsDeleted,
-      long expectedManifestListsDeleted, ExpireSnapshots.Result results) {
+  private void checkExpirationResults(long expectedDatafiles, long 
expectedPosDeleteFiles, long expectedEqDeleteFiles,

Review Comment:
   Thanks for fixing the indentation!



##########
.palantir/revapi.yml:
##########
@@ -27,6 +27,12 @@ acceptedBreaks:
     - code: "java.method.addedToInterface"
       new: "method ThisT 
org.apache.iceberg.SnapshotUpdate<ThisT>::scanManifestsWith(java.util.concurrent.ExecutorService)"
       justification: "Accept all changes prior to introducing API 
compatibility checks"
+    - code: "java.method.addedToInterface"
+      new: "method long 
org.apache.iceberg.actions.ExpireSnapshots.Result::deletedEqualityDeleteFilesCount()"
+      justification: "Adding method to interface"

Review Comment:
   nit: Shall we be a little bit more specific?



##########
core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java:
##########
@@ -22,13 +22,29 @@
 public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result 
{
 
   private final long deletedDataFilesCount;
+  private final long deletedPosDeleteFilesCount;
+  private final long deletedEqDeleteFilesCount;
   private final long deletedManifestsCount;
   private final long deletedManifestListsCount;
 
   public BaseExpireSnapshotsActionResult(long deletedDataFilesCount,
                                          long deletedManifestsCount,
                                          long deletedManifestListsCount) {
     this.deletedDataFilesCount = deletedDataFilesCount;
+    this.deletedPosDeleteFilesCount = 0;
+    this.deletedEqDeleteFilesCount = 0;
+    this.deletedManifestsCount = deletedManifestsCount;
+    this.deletedManifestListsCount = deletedManifestListsCount;
+  }
+
+  public BaseExpireSnapshotsActionResult(long deletedDataFilesCount,
+                                         long deletedPosDeleteFilesCount,
+                                         long deletedEqDeleteFileCount,

Review Comment:
   nit: `File` -> `Files`



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##########
@@ -122,18 +128,29 @@ protected Table newStaticTable(TableMetadata metadata, 
FileIO io) {
     return new BaseTable(ops, metadataFileLocation);
   }
 
-  // builds a DF of delete and data file locations by reading all manifests
-  protected Dataset<Row> buildValidContentFileDF(Table table) {
+  // builds a DF of delete and data file path and type by reading all manifests
+  protected Dataset<Row> buildValidContentFileWithTypeDF(Table table) {
     JavaSparkContext context = 
JavaSparkContext.fromSparkContext(spark.sparkContext());
-    Broadcast<FileIO> ioBroadcast = 
context.broadcast(SparkUtil.serializableFileIO(table));
+    Broadcast<Table> tableBroadcast = 
context.broadcast(SerializableTable.copyOf(table));
 
     Dataset<ManifestFileBean> allManifests = loadMetadataTable(table, 
ALL_MANIFESTS)
-        .selectExpr("path", "length", "partition_spec_id as partitionSpecId", 
"added_snapshot_id as addedSnapshotId")
+        .selectExpr("content",

Review Comment:
   nit: you could probably move `content` on a new line as well.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##########
@@ -122,18 +128,29 @@ protected Table newStaticTable(TableMetadata metadata, 
FileIO io) {
     return new BaseTable(ops, metadataFileLocation);
   }
 
-  // builds a DF of delete and data file locations by reading all manifests
-  protected Dataset<Row> buildValidContentFileDF(Table table) {
+  // builds a DF of delete and data file path and type by reading all manifests
+  protected Dataset<Row> buildValidContentFileWithTypeDF(Table table) {
     JavaSparkContext context = 
JavaSparkContext.fromSparkContext(spark.sparkContext());
-    Broadcast<FileIO> ioBroadcast = 
context.broadcast(SparkUtil.serializableFileIO(table));
+    Broadcast<Table> tableBroadcast = 
context.broadcast(SerializableTable.copyOf(table));

Review Comment:
   Not directly related to this PR but should we just use `sparkContext` 
defined as the instance variable instead of creating a new wrapper in this 
method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4629: Spark: Add positional and equality delete file count to ExpireSnapshot results

Reply via email to