Github user BryanCutler commented on a diff in the pull request:
    --- Diff: python/pyspark/sql/ ---
    @@ -1941,12 +1941,24 @@ def toPandas(self):
                 timezone = None
             if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
    +            should_fall_back = False
    -                from pyspark.sql.types import 
_check_dataframe_convert_date, \
    -                    _check_dataframe_localize_timestamps
    +                from pyspark.sql.types import to_arrow_schema
                     from pyspark.sql.utils import 
    -                import pyarrow
    +                # Check if its schema is convertible in Arrow format.
    +                to_arrow_schema(self.schema)
    +            except Exception as e:
    +                # Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
    +                should_fall_back = True
    +                warnings.warn(
    +                    "Arrow will not be used in toPandas: %s" % 
    +            if not should_fall_back:
    +                import pyarrow
    +                from pyspark.sql.types import 
_check_dataframe_convert_date, \
    +                    _check_dataframe_localize_timestamps
                     tables = self._collectAsArrow()
    --- End diff --
    I see, we don't want to collect twice and you manually run a schema 
conversion to fallback in that case.  I think there still might be some cases 
where the Arrow path could fail, like maybe if there were incompatible arrow 
versions (like using a possible future version of pyarrow with Java still at 
0.8) but this should cover the most common cases, so seems fine to me.


To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to