JingsongLi commented on code in PR #8021:
URL: https://github.com/apache/paimon/pull/8021#discussion_r3347318601


##########
paimon-python/pypaimon/read/split_read.py:
##########
@@ -823,12 +830,27 @@ def create_reader(self) -> RecordReader:
         else:
             reader = merge_reader
 
-        if (not CoreOptions.blob_as_descriptor(self.table.options)
-                and CoreOptions.blob_descriptor_fields(self.table.options)):
-            reader = BlobDescriptorConvertReader(reader, self.table)
-
         return reader
 
+    def _create_prescan_reader(self, field_names):
+        """Create a prescan reader by constructing a new DataEvolutionSplitRead
+        instance that only projects the specified field names."""
+        from pypaimon.read.reader.iface.record_batch_reader import 
EmptyRecordBatchReader
+
+        prescan_fields = [f for f in self.read_fields if f.name in field_names]
+        if not prescan_fields:
+            return EmptyRecordBatchReader()
+
+        prescan_read = DataEvolutionSplitRead(

Review Comment:
   This prescan reader only projects the blob-view columns, so 
`SplitRead.__init__` drops any predicate whose fields are outside that reduced 
projection, and it also does not receive the original read limit. As a result, 
`BlobInlineConvertReader._prescan_view_structs()` can preload and resolve 
`BlobViewStruct`s from rows that the final read would filter out or skip. I 
reproduced this with two target rows where `id = 1` is valid and `id = 2` 
points at a missing upstream table: reading with `with_filter(id == 1)` or 
`with_limit(1)` still fails during prescan on row 2. The prescan should 
preserve the effective filter/limit semantics, or otherwise only scan rows that 
can be returned.



##########
paimon-python/pypaimon/schema/schema.py:
##########
@@ -62,39 +62,8 @@ def from_pyarrow_schema(pa_schema: pa.Schema, 
partition_keys: Optional[List[str]
                 if field.name in pk_set:
                     field.type.nullable = False
 
-        # Check if Blob type exists in the schema
-        blob_names = [
-            field.name for field in fields
-            if 'blob' in str(field.type).lower()
-        ]
-
-        if blob_names:
-            if options is None:
-                options = {}
-
-            if len(fields) <= len(blob_names):
-                raise ValueError(
-                    "Table with BLOB type column must have other normal 
columns."
-                )
-
-            required_options = {
-                CoreOptions.ROW_TRACKING_ENABLED.key(): 'true',
-                CoreOptions.DATA_EVOLUTION_ENABLED.key(): 'true'
-            }
-
-            missing_options = []
-            for key, expected_value in required_options.items():
-                if key not in options or options[key] != expected_value:
-                    missing_options.append(f"{key}='{expected_value}'")
-
-            if missing_options:
-                raise ValueError(
-                    f"Schema contains Blob type but is missing required 
options: {', '.join(missing_options)}. "
-                    f"Please add these options to the schema."
-                )
-
-            if primary_keys is not None:
-                raise ValueError("Blob type is not supported with primary 
key.")
+        # Validate Blob type fields in the schema
+        Schema._validate_blob_fields(fields, options, primary_keys)

Review Comment:
   This validation only runs when callers create schemas through 
`Schema.from_pyarrow_schema(...)`. Callers can still construct 
`Schema(fields=..., options=...)` directly and `SchemaManager.create_table` 
will commit invalid `blob-view-field` / `blob-descriptor-field` settings, for 
example a `blob-view-field` pointing at a STRING column. Since direct 
`Schema(...)` construction is a supported/tested path, this validation should 
also run in `SchemaManager.create_table` after directives are applied, or 
otherwise be centralized before the schema is committed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to