viirya commented on a change in pull request #22807: [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow URL: https://github.com/apache/spark/pull/22807#discussion_r249626037
########## File path: docs/sql-migration-guide-upgrade.md ########## @@ -41,6 +41,54 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring. + - In PySpark, when Arrow optimization is enabled, if Arrow version is higher than 0.11.0, Arrow can perform safe type conversion when converting Pandas.Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. Setting `spark.sql.execution.pandas.arrowSafeTypeConversion` to true can enable it. The default setting is false. PySpark's behavior for Arrow versions is illustrated in the table below: + <table class="table"> + <tr> + <th> + <b>PyArrow version</b> + </th> + <th> + <b>Integer Overflow</b> + </th> + <th> + <b>Floating Point Truncation</b> + </th> + </tr> + <tr> + <th> + <b>version < 0.11.0</b> + </th> + <th> + <b>Raise error</b> + </th> + <th> + <b>Silently allows</b> + </th> + </tr> + <tr> + <th> + <b>version > 0.11.0, arrowSafeTypeConversion=false</b> + </th> + <th> + <b>Silent overflow</b> + </th> + <th> + <b>Silently allows</b> + </th> + </tr> + <tr> + <th> + <b>version > 0.11.0, arrowSafeTypeConversion=true</b> Review comment: I think this config can be kept even we set the minimal Arrow version to 0.12.0. If anything goes wrong, users still can disable the type check. For now, there isn't consistent behavior across integer overflow and float point truncation. Either `true` or `false` causes behavior change. It is `false` by default to make it consistent to non-arrow UDFs. If we are going to change it to `true` in the future, isn't a behavior change again? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
