Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20567#discussion_r167711864
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
timezone = None
if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled",
"false").lower() == "true":
+ should_fall_back = False
try:
- from pyspark.sql.types import
_check_dataframe_convert_date, \
- _check_dataframe_localize_timestamps
+ from pyspark.sql.types import to_arrow_schema
from pyspark.sql.utils import
require_minimum_pyarrow_version
- import pyarrow
require_minimum_pyarrow_version()
+ # Check if its schema is convertible in Arrow format.
+ to_arrow_schema(self.schema)
+ except Exception as e:
+ # Fallback to convert to Pandas DataFrame without arrow if
raise some exception
+ should_fall_back = True
+ warnings.warn(
+ "Arrow will not be used in toPandas: %s" %
_exception_message(e))
+
+ if not should_fall_back:
+ import pyarrow
+ from pyspark.sql.types import
_check_dataframe_convert_date, \
+ _check_dataframe_localize_timestamps
+
tables = self._collectAsArrow()
--- End diff --
I see, we don't want to collect twice and you manually run a schema
conversion to fallback in that case. I think there still might be some cases
where the Arrow path could fail, like maybe if there were incompatible arrow
versions (like using a possible future version of pyarrow with Java still at
0.8) but this should cover the most common cases, so seems fine to me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]