flyrain commented on code in PR #3415:
URL: https://github.com/apache/polaris/pull/3415#discussion_r2689031469


##########
runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -160,13 +213,14 @@ private Stream<TaskEntity> getManifestTaskStream(
     // read the manifest list for each snapshot. dedupe the manifest files and 
schedule a
     // cleanupTask
     // for each manifest file and its data files to be deleted
+    // Use a Set to track seen paths for deduplication without materializing 
all ManifestFile
+    // objects
+    Set<String> seenPaths = new HashSet<>();
     return tableMetadata.snapshots().stream()
         .flatMap(sn -> sn.allManifests(fileIO).stream())
-        // distinct by manifest path, since multiple snapshots will contain 
the same
-        // manifest
-        .collect(Collectors.toMap(ManifestFile::path, Function.identity(), 
(mf1, mf2) -> mf1))
-        .values()
-        .stream()
+        // distinct by manifest path, since multiple snapshots will contain 
the same manifest
+        // Use stateful filter to dedupe while streaming
+        .filter(mf -> seenPaths.add(mf.path()))
         .filter(mf -> TaskUtils.exists(mf.path(), fileIO))

Review Comment:
   Once we call `.collect(Collectors.toSet()`, the stream is fully 
materialized, which will lose the benefit of lazy execution. Here we are trying 
lower the memory footprint based on lazy execution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to