rdblue commented on a change in pull request #1344:
URL: https://github.com/apache/iceberg/pull/1344#discussion_r473355211
##########
File path:
spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -147,49 +149,41 @@ public ExpireSnapshotsAction deleteWith(Consumer<String>
newDeleteFunc) {
@Override
public ExpireSnapshotsActionResult execute() {
- Dataset<Row> originalFiles = null;
- try {
- // Metadata before Expiration
- originalFiles = buildValidFileDF().persist();
- // Action to trigger persist
- originalFiles.count();
-
- // Perform Expiration
- ExpireSnapshots expireSnaps =
table.expireSnapshots().cleanExpiredFiles(false);
- for (final Long id : expireSnapshotIdValues) {
- expireSnaps = expireSnaps.expireSnapshotId(id);
- }
-
- if (expireOlderThanValue != null) {
- expireSnaps = expireSnaps.expireOlderThan(expireOlderThanValue);
- }
-
- if (retainLastValue != null) {
- expireSnaps = expireSnaps.retainLast(retainLastValue);
- }
-
- expireSnaps.commit();
-
- // Metadata after Expiration
- Dataset<Row> validFiles = buildValidFileDF();
- Dataset<Row> filesToDelete = originalFiles.except(validFiles);
-
- return deleteFiles(filesToDelete.toLocalIterator());
- } finally {
- if (originalFiles != null) {
- originalFiles.unpersist();
- }
+ // Metadata before Expiration
+ Dataset<Row> originalFiles = buildValidFileDF(ops.current());
+
+ // Perform Expiration
+ ExpireSnapshots expireSnaps =
table.expireSnapshots().cleanExpiredFiles(false);
+ for (final Long id : expireSnapshotIdValues) {
+ expireSnaps = expireSnaps.expireSnapshotId(id);
+ }
+
+ if (expireOlderThanValue != null) {
+ expireSnaps = expireSnaps.expireOlderThan(expireOlderThanValue);
+ }
+
+ if (retainLastValue != null) {
+ expireSnaps = expireSnaps.retainLast(retainLastValue);
}
+
+ expireSnaps.commit();
+
+ // Metadata after Expiration
+ Dataset<Row> validFiles = buildValidFileDF(ops.refresh());
+ Dataset<Row> filesToDelete = originalFiles.except(validFiles);
+
+ return deleteFiles(filesToDelete.toLocalIterator());
}
private Dataset<Row> appendTypeString(Dataset<Row> ds, String type) {
return ds.select(new Column("file_path"),
functions.lit(type).as("file_type"));
}
- private Dataset<Row> buildValidFileDF() {
- return appendTypeString(buildValidDataFileDF(spark), DATA_FILE)
- .union(appendTypeString(buildManifestFileDF(spark), MANIFEST))
- .union(appendTypeString(buildManifestListDF(spark, table),
MANIFEST_LIST));
+ private Dataset<Row> buildValidFileDF(TableMetadata metadata) {
+ StaticTableOperations staticOps = new
StaticTableOperations(metadata.metadataFileLocation(), table.io());
Review comment:
Minor: the metadata file location is passed to `buildManifestFileDF` and
`buildValidDataFileDF`, but `StaticTableOperations` is passed into
`buildManifestListDF`. I think it would make a more consistent API if the
location were also passed to `buildManifestListDF`.
I know that the difference is that the method accepts a `Table` and doesn't
use a metadata table. But it would be a bit cleaner to pass the base `Table`
and metadata location, then create the `StaticTableOperations` in that method
rather than here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]