JingsongLi commented on code in PR #8021:
URL: https://github.com/apache/paimon/pull/8021#discussion_r3348135024
##########
paimon-python/pypaimon/read/split_read.py:
##########
@@ -823,12 +830,27 @@ def create_reader(self) -> RecordReader:
else:
reader = merge_reader
- if (not CoreOptions.blob_as_descriptor(self.table.options)
- and CoreOptions.blob_descriptor_fields(self.table.options)):
- reader = BlobDescriptorConvertReader(reader, self.table)
-
return reader
+ def _create_prescan_reader(self, field_names):
+ """Create a prescan reader by constructing a new DataEvolutionSplitRead
+ instance that only projects the specified field names."""
+ from pypaimon.read.reader.iface.record_batch_reader import
EmptyRecordBatchReader
+
+ prescan_fields = [f for f in self.read_fields if f.name in field_names]
+ if not prescan_fields:
+ return EmptyRecordBatchReader()
+
+ prescan_read = DataEvolutionSplitRead(
Review Comment:
Thanks for checking. The missing upstream table was only a minimal way to
make the semantic issue visible. The correctness problem is that the prescan
observes rows that the final query would filter out or skip. A filtered-out row
may contain a stale BlobViewStruct, a permission/network failure, or simply
many expensive references, and that should not make `WHERE id = 1` fail or
trigger extra upstream reads.
Also, a limit-pushdown change in another PR would not address the predicate
case here, because this prescan rebuilds the read type with only blob-view
fields, so predicates on non-view columns are trimmed before the prescan reader
is built. I think this PR needs to preserve the effective filter/limit in the
prescan path, matching the Java path, and add regression tests for both cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]