RussellSpitzer commented on a change in pull request #1471:
URL: https://github.com/apache/iceberg/pull/1471#discussion_r490575150
##########
File path:
spark/src/main/java/org/apache/iceberg/actions/RemoveOrphanFilesAction.java
##########
@@ -254,4 +270,54 @@ private static void listDirRecursively(
return files.iterator();
};
}
+
+ protected static List<String> findOrphanFiles(
+ Dataset<Row> validFileDF,
+ Dataset<Row> actualFileDF) {
+ Column nameEqual = filenameUDF.apply(actualFileDF.col(FILE_PATH_ONLY))
+ .equalTo(filenameUDF.apply(validFileDF.col(FILE_PATH_ONLY)));
+
+ Column pathContains = actualFileDF.col(FILE_PATH_ONLY)
+ .contains(validFileDF.col(FILE_PATH_ONLY));
+
+ Column joinCond = nameEqual.and(pathContains);
+ Column decodeFilepath = decodeUDF.apply(actualFileDF.col(FILE_PATH));
+ return actualFileDF.join(validFileDF, joinCond,
"leftanti").select(decodeFilepath)
+ .as(Encoders.STRING())
+ .collectAsList();
+ }
+
+ /**
+ * From
+ * <pre>{@code
+ * Dataset<Row<file_path_with_scheme_authority>>
+ * will be transformed to
+ * Dataset<Row<file_path_no_scheme_authority,
file_path_with_scheme_authority>>
+ * }</pre>
+ *
+ * This is required to compare the valid and all files to find the orphan
files.
+ * Based on the result data set, only path will be compared while comparing
valid and all files path.
+ * As in the case of hadoop, s3, there could be different authority names to
access same path, which can give us files
+ * which are part of metadata and not orphan.
+ *
+ * @param filePathWithSchemeAndAuthority : complete file path, can include
scheme, authority and path.
+ * @return : {@code file_path_no_scheme_authority, file_path}
+ */
+ protected static Dataset<Row> addFilePathOnlyColumn(Dataset<Row>
filePathWithSchemeAndAuthority) {
+ String selectExprFormat = "%s.%s as %s";
+ return filePathWithSchemeAndAuthority.withColumn(URI_DETAIL,
+ addFilePathOnlyUDF.apply(
+ filePathWithSchemeAndAuthority.apply(FILE_PATH)
+ )).selectExpr(
+ String.format(selectExprFormat, URI_DETAIL, FILE_PATH_ONLY,
FILE_PATH_ONLY), // file path only
+ String.format(selectExprFormat, URI_DETAIL, FILE_PATH,
FILE_PATH)); // fully qualified path
+ }
+
+ static StructType fileDetailStructType() {
Review comment:
```java
private static final StructType FILE_DETAIL_STRUCT = new StructType(new
StructField[]{
DataTypes.createStructField(FILE_PATH_ONLY, DataTypes.StringType,
false),
DataTypes.createStructField(FILE_PATH, DataTypes.StringType, false)
});
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]