Re: [PR] [core] Support push-down valueStats prune when read manifests. [paimon]

via GitHub Tue, 02 Jun 2026 00:33:39 -0700


JingsongLi commented on code in PR #7392:
URL: https://github.com/apache/paimon/pull/7392#discussion_r3339441297



##########
paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreScan.java:
##########
@@ -493,20 +502,19 @@ private List<ManifestEntry> readManifest(
                                 manifest.fileSize(),
                                 manifestsReader.partitionFilter(),
                                 createBucketFilter(),
-                                createEntryRowFilter().and(additionalFilter),
+                                entryRowFilter.and(additionalFilter),
                                 entry ->
                                         (additionalTFilter == null || 
additionalTFilter.test(entry))
                                                 && (manifestEntryFilter == null
                                                         || 
manifestEntryFilter.test(entry))
                                                 && filterByStats(entry));
-        if (dropStats) {
-            List<ManifestEntry> copied = new ArrayList<>(entries.size());
-            for (ManifestEntry entry : entries) {
-                copied.add(dropStats(entry));
-            }
-            entries = copied;
+
+        List<T> result = new ArrayList<>(entries.size());
+        for (ManifestEntry entry : entries) {
+            result.add(converter.apply(dropStats ? dropStats(entry) : entry));

Review Comment:
   This still materializes the full `List<ManifestEntry>` from 
`ManifestFile.read()` before applying `dropStats`/`converter`, so peak memory 
during manifest read still includes all `valueStats`. That seems to miss the 
main goal of pushing the prune down into manifest reading: callers using 
`dropStats()` can still OOM before this loop gets a chance to copy the entries 
without stats.



##########
paimon-core/src/main/java/org/apache/paimon/utils/ObjectsCache.java:
##########
@@ -78,22 +98,38 @@ public List<V> read(K key, @Nullable Long fileSize, 
Filters<V> filters) throws I
                 fileSize = fileSizeFunction.apply(key);
             }
             if (fileSize <= cache.maxElementSize()) {
-                segments = createSegments(key, fileSize);
+                segments = createSegments(key, fileSize, loadFilter);

Review Comment:
   This makes the cache unsafe for any selective `loadFilter`: the cache key is 
still only `key`, so a cold read can store a filtered subset of the file, and a 
later read of the same key with a broader/different filter will hit that 
partial cache and silently miss rows. The new test currently asserts the first 
filtered read only; it should also read the same key again with 
`Filter.alwaysTrue()` and verify all rows are still returned. Could we avoid 
putting filtered segments into this cache, or include the load 
filter/projection in the cache identity?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [core] Support push-down valueStats prune when read manifests. [paimon]

Reply via email to