flyrain commented on code in PR #3415:
URL: https://github.com/apache/polaris/pull/3415#discussion_r2689031469
##########
runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java:
##########
@@ -160,13 +213,14 @@ private Stream<TaskEntity> getManifestTaskStream(
// read the manifest list for each snapshot. dedupe the manifest files and
schedule a
// cleanupTask
// for each manifest file and its data files to be deleted
+ // Use a Set to track seen paths for deduplication without materializing
all ManifestFile
+ // objects
+ Set<String> seenPaths = new HashSet<>();
return tableMetadata.snapshots().stream()
.flatMap(sn -> sn.allManifests(fileIO).stream())
- // distinct by manifest path, since multiple snapshots will contain
the same
- // manifest
- .collect(Collectors.toMap(ManifestFile::path, Function.identity(),
(mf1, mf2) -> mf1))
- .values()
- .stream()
+ // distinct by manifest path, since multiple snapshots will contain
the same manifest
+ // Use stateful filter to dedupe while streaming
+ .filter(mf -> seenPaths.add(mf.path()))
.filter(mf -> TaskUtils.exists(mf.path(), fileIO))
Review Comment:
Once we call `.collect(Collectors.toSet()`, the stream is fully
materialized, which will lose the benefit of lazy execution. Here we are trying
lower the memory footprint based on lazy execution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]