JingsongLi commented on code in PR #7392:
URL: https://github.com/apache/paimon/pull/7392#discussion_r3339441297
##########
paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreScan.java:
##########
@@ -493,20 +502,19 @@ private List<ManifestEntry> readManifest(
manifest.fileSize(),
manifestsReader.partitionFilter(),
createBucketFilter(),
- createEntryRowFilter().and(additionalFilter),
+ entryRowFilter.and(additionalFilter),
entry ->
(additionalTFilter == null ||
additionalTFilter.test(entry))
&& (manifestEntryFilter == null
||
manifestEntryFilter.test(entry))
&& filterByStats(entry));
- if (dropStats) {
- List<ManifestEntry> copied = new ArrayList<>(entries.size());
- for (ManifestEntry entry : entries) {
- copied.add(dropStats(entry));
- }
- entries = copied;
+
+ List<T> result = new ArrayList<>(entries.size());
+ for (ManifestEntry entry : entries) {
+ result.add(converter.apply(dropStats ? dropStats(entry) : entry));
Review Comment:
This still materializes the full `List<ManifestEntry>` from
`ManifestFile.read()` before applying `dropStats`/`converter`, so peak memory
during manifest read still includes all `valueStats`. That seems to miss the
main goal of pushing the prune down into manifest reading: callers using
`dropStats()` can still OOM before this loop gets a chance to copy the entries
without stats.
##########
paimon-core/src/main/java/org/apache/paimon/utils/ObjectsCache.java:
##########
@@ -78,22 +98,38 @@ public List<V> read(K key, @Nullable Long fileSize,
Filters<V> filters) throws I
fileSize = fileSizeFunction.apply(key);
}
if (fileSize <= cache.maxElementSize()) {
- segments = createSegments(key, fileSize);
+ segments = createSegments(key, fileSize, loadFilter);
Review Comment:
This makes the cache unsafe for any selective `loadFilter`: the cache key is
still only `key`, so a cold read can store a filtered subset of the file, and a
later read of the same key with a broader/different filter will hit that
partial cache and silently miss rows. The new test currently asserts the first
filtered read only; it should also read the same key again with
`Filter.alwaysTrue()` and verify all rows are still returned. Could we avoid
putting filtered segments into this cache, or include the load
filter/projection in the cache identity?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]