flyrain commented on code in PR #3415:
URL: https://github.com/apache/polaris/pull/3415#discussion_r2695390578


##########
runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -160,13 +213,14 @@ private Stream<TaskEntity> getManifestTaskStream(
     // read the manifest list for each snapshot. dedupe the manifest files and 
schedule a
     // cleanupTask
     // for each manifest file and its data files to be deleted
+    // Use a Set to track seen paths for deduplication without materializing 
all ManifestFile
+    // objects
+    Set<String> seenPaths = new HashSet<>();
     return tableMetadata.snapshots().stream()
         .flatMap(sn -> sn.allManifests(fileIO).stream())
-        // distinct by manifest path, since multiple snapshots will contain 
the same
-        // manifest
-        .collect(Collectors.toMap(ManifestFile::path, Function.identity(), 
(mf1, mf2) -> mf1))
-        .values()
-        .stream()
+        // distinct by manifest path, since multiple snapshots will contain 
the same manifest
+        // Use stateful filter to dedupe while streaming
+        .filter(mf -> seenPaths.add(mf.path()))

Review Comment:
   Thanks for the detailed analysis. I agree that the only remaining unbounded 
structure here is the in memory set used to dedup manifest paths. I do not 
think this is a practical concern.
   
   To put concrete numbers on it, with an extreme case that 1 million file 
paths and an estimated 50 to 100 bytes per path including object and set 
overhead, the memory footprint would be roughly 40 MB to 95 MB, which is 
acceptable. That is already a very large table cleanup scenario. At that scale, 
the question becomes whether we even want the Polaris server itself to handle 
such a task synchronously in memory. A delegation service would fit better in 
that case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to