Re: [PR] [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark [spark]

via GitHub Fri, 17 Oct 2025 16:44:36 -0700


Yicong-Huang commented on code in PR #52467:
URL: https://github.com/apache/spark/pull/52467#discussion_r2408849437



##########
python/pyspark/sql/conversion.py:
##########
@@ -735,7 +744,11 @@ def convert(
 
     @staticmethod  # type: ignore[misc]
     def convert(
-        table: "pa.Table", schema: StructType, *, return_as_tuples: bool = 
False
+        table: "pa.Table",
+        schema: StructType,
+        *,
+        return_as_tuples: bool = False,
+        binary_as_bytes: bool = True,

Review Comment:
   maybe this should be controlled by flag?



##########
python/pyspark/sql/tests/connect/arrow/test_parity_arrow_python_udf.py:
##########
@@ -46,6 +48,14 @@ def tearDownClass(cls):
         finally:
             super().tearDownClass()
 
+    @unittest.skip("Duplicate test as it is already tested in 
ArrowPythonUDFLegacyTests.")
+    def test_udf_binary_type(self):
+        super().test_udf_binary_type(self)
+
+    @unittest.skip("Duplicate test as it is already tested in 
ArrowPythonUDFLegacyTests.")
+    def test_udf_binary_type_in_nested_structures(self):
+        super().test_udf_binary_type_in_nested_structures(self)
+

Review Comment:
   why do we add tests then skip them?



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1823,7 +1823,14 @@ def collect(self) -> List[Row]:
 
         assert schema is not None and isinstance(schema, StructType)
 
-        return ArrowTableToRowsConversion.convert(table, schema)
+        return ArrowTableToRowsConversion.convert(
+            table, schema, binary_as_bytes=self._get_binary_as_bytes()
+        )
+
+    def _get_binary_as_bytes(self) -> bool:
+        """Get the binary_as_bytes configuration value from Spark session."""
+        conf_value = 
self._session.conf.get("spark.sql.execution.pyspark.binaryAsBytes", "true")
+        return (conf_value or "true").lower() == "true"

Review Comment:
   for optional, it can only be "None" or a string right? maybe we can use
   
   ```
   con_value is not None and con_value.lower() == "true"
   ```
   to bypass it?



##########
python/docs/source/tutorial/sql/type_conversions.rst:
##########
@@ -67,6 +67,9 @@ are listed below:
     * - spark.sql.execution.pandas.inferPandasDictAsMap
       - When enabled, Pandas dictionaries are inferred as MapType. Otherwise, 
they are inferred as StructType.
       - False
+    * - spark.sql.execution.pyspark.binaryAsBytes
+      - Introduced in Spark 4.1.0. When enabled, BinaryType is mapped 
consistently to Python bytes; when disabled, matches the PySpark default 
behavior before 4.1.0.

Review Comment:
   maybe we can explain briefly what is the default behavior before 4.1.0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark [spark]

Reply via email to