Re: [PR] [SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark [spark]

via GitHub Sat, 18 Oct 2025 07:15:17 -0700


xianzhe-databricks commented on code in PR #52467:
URL: https://github.com/apache/spark/pull/52467#discussion_r2435248776



##########
python/docs/source/migration_guide/pyspark_upgrade.rst:
##########
@@ -29,6 +29,17 @@ Upgrading from PySpark 4.0 to 4.1
 * In Spark 4.1, Arrow-optimized Python UDF supports UDT input / output instead 
of falling back to the regular UDF. To restore the legacy behavior, set 
``spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT`` to ``true``.
 * In Spark 4.1, unnecessary conversion to pandas instances is removed when 
``spark.sql.execution.pythonUDF.arrow.enabled`` is enabled. As a result, the 
type coercion changes when the produced output has a schema different from the 
specified schema. To restore the previous behavior, enable 
``spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled``.
 * In Spark 4.1, unnecessary conversion to pandas instances is removed when 
``spark.sql.execution.pythonUDTF.arrow.enabled`` is enabled. As a result, the 
type coercion changes when the produced output has a schema different from the 
specified schema. To restore the previous behavior, enable 
``spark.sql.legacy.execution.pythonUDTF.pandas.conversion.enabled``.
+* In Spark 4.1, the data type ``BinaryType`` is mapped to Python ``bytes`` 
consistently in PySpark.
+  To restore the previous behavior, set 
``spark.sql.execution.pyspark.binaryAsBytes`` to ``true``. The behavior before 
Spark 4.1.0 is illustrated in the following table:
+
+    
=============================================================================== 
 ==============================
+    Case                                                                       
      Python type for ``BinaryType``
+    
=============================================================================== 
 ==============================
+    Regular UDF and UDTF without Arrow optimization                            
      ``bytearray``
+    DataFrame APIs (both Spark Classic and Spark Connect)                      
      ``bytearray``
+    Data sources                                                               
      ``bytearray``
+    Arrow-optimized UDF and UDTF with unnecessary conversion to pandas 
instances     ``bytes``
+    
=============================================================================== 
 ==============================

Review Comment:
   @ueshin @allisonwang-db migration guide is added!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53696][PYTHON][CONNECT][SQL] Default to bytes for BinaryType in PySpark [spark]

Reply via email to