aokolnychyi commented on a change in pull request #1471:
URL: https://github.com/apache/iceberg/pull/1471#discussion_r491051153



##########
File path: 
spark/src/main/java/org/apache/iceberg/actions/RemoveOrphanFilesAction.java
##########
@@ -70,15 +73,34 @@
 public class RemoveOrphanFilesAction extends BaseAction<List<String>> {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(RemoveOrphanFilesAction.class);
-  private static final UserDefinedFunction filename = functions.udf((String 
path) -> {
-    int lastIndex = path.lastIndexOf(File.separator);
-    if (lastIndex == -1) {
-      return path;
-    } else {
-      return path.substring(lastIndex + 1);
-    }
+
+  private static final String URI_DETAIL = "URI_DETAIL";
+  private static final String FILE_NAME = "file_name";
+  private static final String FILE_PATH = "file_path";
+  private static final String FILE_PATH_ONLY = "file_path_only";
+  private static final StructType FILE_DETAIL_STRUCT =  new StructType(new 
StructField[] {
+      DataTypes.createStructField(FILE_NAME, DataTypes.StringType, false),
+      DataTypes.createStructField(FILE_PATH_ONLY, DataTypes.StringType, false),
+      DataTypes.createStructField(FILE_PATH, DataTypes.StringType, false)
+  });
+
+  private static final UserDefinedFunction decodeUDF = functions.udf((String 
fullyQualifiedPath) -> {
+    return URLDecoder.decode(fullyQualifiedPath, "UTF-8");
   }, DataTypes.StringType);
 
+  /**
+   * Transform a file path to
+   * {@code Dataset<Row<file_name, file_path_no_scheme_authority, 
file_path_with_scheme_authority>>}
+   */
+  private static final UserDefinedFunction addFileDetailsUDF = 
functions.udf((String fileLocation) -> {
+    Path fullyQualifiedPath = new Path(fileLocation);
+    String fileName = fullyQualifiedPath.getName();
+    String filePathOnly = fullyQualifiedPath.toUri().getPath();
+    String filePath = fullyQualifiedPath.toUri().toString();

Review comment:
       Don't we have `file_path` column already that contains the fully 
qualified path to be deleted? We pass that column to this UDF. I think we can 
directly use it and it should be already decoded if I am not mistaken. If so, 
then we won't need `decodeUDF`.
   
   @manishmalhotrawork, what do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to