spark git commit: [SPARK-23446][PYTHON] Explicitly check supported types in toPandas

lixiao Fri, 16 Feb 2018 09:41:45 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 75bb19a01 -> ccb0a59d7



[SPARK-23446][PYTHON] Explicitly check supported types in toPandas

## What changes were proposed in this pull request?

This PR explicitly specifies and checks the types we supported in `toPandas`. 
This was a hole. For example, we haven't finished the binary type support in 
Python side yet but now it allows as below:

```python
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
```

```
     _1
0  [97]
  _1
0  a
```

This should be disallowed. I think the same things also apply to nested 
timestamps too.

I also added some nicer message about `spark.sql.execution.arrow.enabled` in 
the error message.

## How was this patch tested?

Manually tested and tests added in `python/pyspark/sql/tests.py`.

Author: hyukjinkwon <[email protected]>

Closes #20625 from HyukjinKwon/pandas_convertion_supported_type.

(cherry picked from commit c5857e496ff0d170ed0339f14afc7d36b192da6d)
Signed-off-by: gatorsmile <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccb0a59d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccb0a59d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccb0a59d

Branch: refs/heads/branch-2.3
Commit: ccb0a59d7383db451b86aee67423eb6e28f1f982
Parents: 75bb19a
Author: hyukjinkwon <[email protected]>
Authored: Fri Feb 16 09:41:17 2018 -0800
Committer: gatorsmile <[email protected]>
Committed: Fri Feb 16 09:41:32 2018 -0800

----------------------------------------------------------------------
 python/pyspark/sql/dataframe.py | 15 +++++++++------
 python/pyspark/sql/tests.py     |  9 ++++++++-
 2 files changed, 17 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ccb0a59d/python/pyspark/sql/dataframe.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index faee870..930d177 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1943,10 +1943,11 @@ class DataFrame(object):
         if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
             try:
                 from pyspark.sql.types import _check_dataframe_convert_date, \
-                    _check_dataframe_localize_timestamps
+                    _check_dataframe_localize_timestamps, to_arrow_schema
                 from pyspark.sql.utils import require_minimum_pyarrow_version
-                import pyarrow
                 require_minimum_pyarrow_version()
+                import pyarrow
+                to_arrow_schema(self.schema)
                 tables = self._collectAsArrow()
                 if tables:
                     table = pyarrow.concat_tables(tables)
@@ -1955,10 +1956,12 @@ class DataFrame(object):
                     return _check_dataframe_localize_timestamps(pdf, timezone)
                 else:
                     return pd.DataFrame.from_records([], columns=self.columns)
-            except ImportError as e:
-                msg = "note: pyarrow must be installed and available on 
calling Python process " \
-                      "if using spark.sql.execution.arrow.enabled=true"
-                raise ImportError("%s\n%s" % (_exception_message(e), msg))
+            except Exception as e:
+                msg = (
+                    "Note: toPandas attempted Arrow optimization because "
+                    "'spark.sql.execution.arrow.enabled' is set to true. 
Please set it to false "
+                    "to disable this.")
+                raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
         else:
             pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ccb0a59d/python/pyspark/sql/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 904fa7a..da50b4d 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3443,7 +3443,14 @@ class ArrowTests(ReusedSQLTestCase):
         schema = StructType([StructField("map", MapType(StringType(), 
IntegerType()), True)])
         df = self.spark.createDataFrame([(None,)], schema=schema)
         with QuietTest(self.sc):
-            with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
+            with self.assertRaisesRegexp(Exception, 'Unsupported type'):
+                df.toPandas()
+
+        df = self.spark.createDataFrame([(None,)], schema="a binary")
+        with QuietTest(self.sc):
+            with self.assertRaisesRegexp(
+                    Exception,
+                    'Unsupported type.*\nNote: toPandas attempted Arrow 
optimization because'):
                 df.toPandas()
 
     def test_null_conversion(self):


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-23446][PYTHON] Explicitly check supported types in toPandas

Reply via email to