singhpk234 commented on code in PR #3415:
URL: https://github.com/apache/polaris/pull/3415#discussion_r2688735584


##########
runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -150,6 +156,53 @@ public boolean handleTask(TaskEntity cleanupTask, 
CallContext callContext) {
     return false;
   }
 
+  private int processTasks(
+      Stream<TaskEntity> taskStream,
+      int batchSize,
+      PolarisMetaStoreManager metaStoreManager,
+      PolarisCallContext polarisCallContext,
+      IcebergTableLikeEntity tableEntity) {
+    int totalCount = 0;
+    Iterator<TaskEntity> iterator = taskStream.iterator();
+    List<TaskEntity> batch = new ArrayList<>(batchSize);
+
+    while (iterator.hasNext()) {
+      batch.add(iterator.next());
+      if (batch.size() >= batchSize) {
+        createAndRegisterTasks(batch, metaStoreManager, polarisCallContext, 
tableEntity);
+        totalCount += batch.size();
+        batch.clear();
+      }
+    }
+
+    // Create remaining tasks
+    if (!batch.isEmpty()) {
+      createAndRegisterTasks(batch, metaStoreManager, polarisCallContext, 
tableEntity);
+      totalCount += batch.size();
+    }
+

Review Comment:
   can explicitly call batch.clear ? 



##########
runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -160,13 +213,14 @@ private Stream<TaskEntity> getManifestTaskStream(
     // read the manifest list for each snapshot. dedupe the manifest files and 
schedule a
     // cleanupTask
     // for each manifest file and its data files to be deleted
+    // Use a Set to track seen paths for deduplication without materializing 
all ManifestFile
+    // objects
+    Set<String> seenPaths = new HashSet<>();
     return tableMetadata.snapshots().stream()
         .flatMap(sn -> sn.allManifests(fileIO).stream())
-        // distinct by manifest path, since multiple snapshots will contain 
the same
-        // manifest
-        .collect(Collectors.toMap(ManifestFile::path, Function.identity(), 
(mf1, mf2) -> mf1))
-        .values()
-        .stream()
+        // distinct by manifest path, since multiple snapshots will contain 
the same manifest
+        // Use stateful filter to dedupe while streaming
+        .filter(mf -> seenPaths.add(mf.path()))
         .filter(mf -> TaskUtils.exists(mf.path(), fileIO))

Review Comment:
   ```
     Set<String> uniquePaths = tableMetadata.snapshots().stream()
         .flatMap(sn -> sn.allManifests(fileIO).stream())
         .map(ManifestFile::path)
         .collect(Collectors.toSet());
   
     return uniquePaths.parallelStream()  // Parallel here!
         .filter(mf -> TaskUtils.exists(mf.path(), fileIO))
         .map(mf -> createManifestTask(...));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to