BryanCutler commented on a change in pull request #22807: 
[SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by 
PyArrow
URL: https://github.com/apache/spark/pull/22807#discussion_r246859417
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -1331,6 +1331,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val PANDAS_ARROW_SAFE_TYPE_CONVERSION =
+    buildConf("spark.sql.execution.pandas.arrowSafeTypeConversion")
+      .internal()
+      .doc("When true, enabling Arrow do safe type conversion check when 
converting" +
+        "Pandas.Series to Arrow Array during serialization. Arrow will raise 
errors " +
+        "when detecting unsafe type conversion. When false, disabling Arrow's 
type " +
+        "check and do type conversions anyway.")
+      .booleanConf
+      .createWithDefault(true)
 
 Review comment:
   I think the big issue with this is when NULL values are introduced in an 
integer column. Pandas will automatically convert these to floating-points to 
represent the NULLs, then when Arrow casts it back to integer, it will raise an 
error due to truncation - I don't think Arrow checks the actual values, but 
maybe it should?  For example, with safe=True:
   
   ```python
   >>> pa.Array.from_pandas(pd.Series([1, None]), type=pa.int32(), safe=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 474, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Floating point value truncated
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to