aokolnychyi commented on code in PR #4652:
URL: https://github.com/apache/iceberg/pull/4652#discussion_r890347161


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java:
##########
@@ -361,4 +403,90 @@ static PathFilter forSpecs(Map<Integer, PartitionSpec> 
specs) {
       return partitionNames.isEmpty() ? HiddenPathFilter.get() : new 
PartitionAwareHiddenPathFilter(partitionNames);
     }
   }
+
+  private static Dataset<Row> sanitizeAndSplitPaths(Dataset<Row> filesDF) {
+    return filesDF.mapPartitions((MapPartitionsFunction<Row, Row>) input ->
+        Iterators.transform(input, row -> {
+          String pathString = row.getString(0);
+          Path path = new Path(pathString);
+          URI uri = path.toUri();
+          List<Object> values = Lists.newArrayList(uri.getScheme(), 
uri.getAuthority(), uri.getPath(), pathString);
+          return 
Row$.MODULE$.apply(scala.collection.JavaConverters.asScalaBuffer(values).toSeq());
+        }), RowEncoder.apply(fileSchema));
+  }
+  static class MapOrphanFilesFunction implements MapPartitionsFunction<Row, 
String> {
+
+    private final List<String> equivalentSchemes;
+    private final List<String> equivalentAuthorities;
+    private final PrefixMisMatchMode mismatchMode;
+    private final StructType scheme;
+
+    MapOrphanFilesFunction(List<String> equivalentSchemes,
+                           List<String> equivalentAuthorities,
+                           PrefixMisMatchMode mismatchMode,
+                           StructType schema) {
+      this.equivalentSchemes = equivalentSchemes;
+      this.equivalentAuthorities = equivalentAuthorities;
+      this.mismatchMode = mismatchMode;
+      this.scheme = schema;
+    }
+
+    @Override
+    public Iterator<String> call(Iterator<Row> value) throws Exception {
+
+      Iterator<String> orphanFilesIterator = Iterators.transform(value, row -> 
{
+        if (isOrphan(row)) {
+          return row.getString(3);
+        } else {
+          return null;
+        }
+      });
+      return Iterators.filter(orphanFilesIterator, StringUtils::isNotBlank);
+    }
+
+    boolean isOrphan(Row row) {
+      String[] fields = scheme.fieldNames();
+
+      // actual file related fields
+      assert (fields[0].equalsIgnoreCase("scheme"));
+      String actualScheme = row.getString(0);
+      assert (fields[1].equalsIgnoreCase("authority"));
+      String actualAuthority = row.getString(1);
+      assert (fields[3].equalsIgnoreCase("file_path"));
+      String actualFullPath = row.getString(3);
+
+      // valid files related fields
+      assert (fields[4].equalsIgnoreCase("scheme"));
+      String validScheme = row.getString(4);
+      assert (fields[5].equalsIgnoreCase("authority"));
+      String validAuthority = row.getString(5);
+      assert (fields[6].equalsIgnoreCase("path"));
+      String validPath = row.getString(6);
+      assert (fields[7].equalsIgnoreCase("file_path"));
+      String validFullPath = row.getString(7);
+      if (validPath == null) {
+        return true;
+      }
+      boolean isEquivalent = isEquivalent(equivalentSchemes, actualScheme, 
validScheme) &&
+          isEquivalent(equivalentAuthorities, actualAuthority, validAuthority);
+      if (!isEquivalent && mismatchMode == PrefixMisMatchMode.ERROR) {

Review Comment:
   > should this throw an exception on the task side? 
   
   Well, that's an open question. It would be nice to collect all conflicting 
pairs and present to the user. We may add that column and check it on the 
driver after collecting the result. I think that would be better than task 
failures.
   
   > If we have two matches for a file, using the left outer join will produce 
two joined rows and one could fail the equivalence check while the other 
succeeds.
   
   Well, that's similar to the use case I mention 
[here](https://github.com/apache/iceberg/pull/4652/files#r866375503). The 
problem is that we don't know whether `b1` and `b2` are indeed different. If we 
return back all conflicting pairs of authorities and schemes, we can assume if 
the user provided maps of equal authorities and schemes, those are complete and 
if there is no entry to indicate `b1` and `b2` are identical, we assume they 
are different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to