Re: [PR] feat: Python Support for BigObject [texera]

via GitHub Mon, 01 Dec 2025 13:22:02 -0800


Yicong-Huang commented on code in PR #4100:
URL: https://github.com/apache/texera/pull/4100#discussion_r2578680660



##########
amber/src/main/python/core/models/schema/schema.py:
##########
@@ -81,26 +81,47 @@ def _from_raw_schema(self, raw_schema: Mapping[str, str]) 
-> None:
     def _from_arrow_schema(self, arrow_schema: pa.Schema) -> None:
         """
         Resets the Schema by converting a pyarrow.Schema.
+        Checks field metadata to detect BIG_OBJECT types.
         :param arrow_schema: a pyarrow.Schema.
         :return:
         """
         self._name_type_mapping = OrderedDict()
         for attr_name in arrow_schema.names:
-            arrow_type = arrow_schema.field(attr_name).type  # type: ignore
-            attr_type = FROM_ARROW_MAPPING[arrow_type.id]
+            field = arrow_schema.field(attr_name)
+
+            # Check metadata for BIG_OBJECT type
+            # (can be stored by either Scala ArrowUtils or Python)
+            is_big_object = (
+                field.metadata and field.metadata.get(b"texera_type") == 
b"BIG_OBJECT"
+            )
+
+            attr_type = (
+                AttributeType.BIG_OBJECT
+                if is_big_object
+                else FROM_ARROW_MAPPING[field.type.id]
+            )
+
             self.add(attr_name, attr_type)
 
     def as_arrow_schema(self) -> pa.Schema:
         """
         Creates a new pyarrow.Schema according to the current Schema.
+        Includes metadata for BIG_OBJECT types to preserve type information.
         :return: pyarrow.Schema
         """
-        return pa.schema(
-            [
-                pa.field(attr_name, TO_ARROW_MAPPING[attr_type])
-                for attr_name, attr_type in self._name_type_mapping.items()
-            ]
-        )
+        fields = [
+            pa.field(
+                attr_name,
+                TO_ARROW_MAPPING[attr_type],
+                metadata=(
+                    {b"texera_type": b"BIG_OBJECT"}
+                    if attr_type == AttributeType.BIG_OBJECT
+                    else None
+                ),
+            )
+            for attr_name, attr_type in self._name_type_mapping.items()
+        ]
+        return pa.schema(fields)

Review Comment:
   Can you use existing arrow type `large_binary`, instead of creating our own 
new type? 
   https://arrow.apache.org/docs/python/generated/pyarrow.large_binary.html



##########
amber/src/main/python/core/models/schema/schema.py:
##########
@@ -81,26 +81,47 @@ def _from_raw_schema(self, raw_schema: Mapping[str, str]) 
-> None:
     def _from_arrow_schema(self, arrow_schema: pa.Schema) -> None:
         """
         Resets the Schema by converting a pyarrow.Schema.
+        Checks field metadata to detect BIG_OBJECT types.
         :param arrow_schema: a pyarrow.Schema.
         :return:
         """
         self._name_type_mapping = OrderedDict()
         for attr_name in arrow_schema.names:
-            arrow_type = arrow_schema.field(attr_name).type  # type: ignore
-            attr_type = FROM_ARROW_MAPPING[arrow_type.id]
+            field = arrow_schema.field(attr_name)
+
+            # Check metadata for BIG_OBJECT type
+            # (can be stored by either Scala ArrowUtils or Python)
+            is_big_object = (
+                field.metadata and field.metadata.get(b"texera_type") == 
b"BIG_OBJECT"
+            )
+
+            attr_type = (
+                AttributeType.BIG_OBJECT
+                if is_big_object
+                else FROM_ARROW_MAPPING[field.type.id]
+            )
+

Review Comment:
   The implementation is too intrusive to `schema.py`. Please move it out so 
schema does not care if a field should be handled specially. 



##########
amber/src/main/python/core/models/schema/big_object.py:
##########
@@ -0,0 +1,98 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+BigObject represents a reference to a large object stored externally (e.g., 
S3).
+This is a schema type class used throughout the system for handling
+BIG_OBJECT attribute types.
+"""
+
+from typing import Optional
+from urllib.parse import urlparse
+
+
+class BigObject:

Review Comment:
   I don't like this name in codebase. `BigObject` is more like a temporary 
project name. Please consider to use `LargeBinary` as a type to align with 
other systems. 



##########
amber/src/main/python/core/models/schema/attribute_type.py:
##########
@@ -92,4 +97,5 @@ class AttributeType(Enum):
     bool: AttributeType.BOOL,
     bytes: AttributeType.BINARY,
     datetime.datetime: AttributeType.TIMESTAMP,
+    BigObject: AttributeType.BIG_OBJECT,

Review Comment:
   please consider change name. also use lowercase only to align with other 
types.



##########
amber/src/main/python/core/architecture/packaging/output_manager.py:
##########
@@ -259,10 +259,21 @@ def emit_state(
         )
 
     def tuple_to_frame(self, tuples: typing.List[Tuple]) -> DataFrame:
+        from core.models.schema.big_object import BigObject
+
         return DataFrame(
             frame=Table.from_pydict(
                 {
-                    name: [t[name] for t in tuples]
+                    name: [
+                        (
+                            # Convert BigObject objects to URI strings
+                            # for Arrow serialization
+                            t[name].uri
+                            if isinstance(t[name], BigObject)
+                            else t[name]
+                        )

Review Comment:
   Very intrusive. it should be handled in schema, not `tuple_to_frame`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Python Support for BigObject [texera]

Reply via email to