[GitHub] [spark] sadhen opened a new pull request #32320: [SPARK-35211][PYSPARK] _create_dataframe: infer schema earlier and do type check

GitBox Sat, 24 Apr 2021 03:22:12 -0700


sadhen opened a new pull request #32320:
URL: https://github.com/apache/spark/pull/32320



   
   ### What changes were proposed in this pull request?
   infra schema earlier and do type check.
   
   This pr fixes SPARK-35211 when schema verification is turned on. If schema 
verification is turned off, the bug described in SPARK-35211 still exists. I 
will create another PR to solve the issue.
   
   
   ### Why are the changes needed?
   ``` python
   spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
   from pyspark.testing.sqlutils  import ExamplePoint
   import pandas as pd
   pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
   df = spark.createDataFrame(pdf)
   df.show()
   ```
   The result is not correct because of incorrect type conversion.
   
   With this PR, type check will be performed:
   ```
   (spark) ➜  spark git:(sadhen/SPARK-35211) ✗ bin/pyspark
   Python 3.8.8 (default, Feb 24 2021, 13:46:16)
   [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   21/04/24 17:42:23 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
         /_/
   
   Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
   Spark context Web UI available at http://172.30.0.12:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1619257343692).
   SparkSession available as 'spark'.
   >>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
   >>> from pyspark.testing.sqlutils  import ExamplePoint
   >>> import pandas as pd
   >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
ExamplePoint(2, 2)])})
   >>> df = spark.createDataFrame(pdf)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 
653, in createDataFrame
       return super(SparkSession, self).createDataFrame(
     File 
"/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 
340, in createDataFrame
       return self._create_dataframe(data, schema, samplingRatio, verifySchema)
     File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 
699, in _create_dataframe
       rdd, schema = self._createFromLocal(map(prepare, data), schema)
     File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 
499, in _createFromLocal
       data = list(data)
     File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 
688, in prepare
       verify_func(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1409, in verify
       verify_value(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1390, in verify_struct
       verifier(v)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1409, in verify
       verify_value(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1304, in verify_udf
       verifier(dataType.toInternal(obj))
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1409, in verify
       verify_value(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1354, in verify_array
       element_verifier(i)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1409, in verify
       verify_value(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1403, in verify_default
       verify_acceptable_types(obj)
     File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 
1291, in verify_acceptable_types
       raise TypeError(new_msg("%s can not accept object %r in type %s"
   TypeError: element in array field point: DoubleType can not accept object 1 
in type <class 'int'>
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   unit test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadhen opened a new pull request #32320: [SPARK-35211][PYSPARK] _create_dataframe: infer schema earlier and do type check

Reply via email to