sadhen opened a new pull request #32320:
URL: https://github.com/apache/spark/pull/32320
### What changes were proposed in this pull request?
infra schema earlier and do type check.
This pr fixes SPARK-35211 when schema verification is turned on. If schema
verification is turned off, the bug described in SPARK-35211 still exists. I
will create another PR to solve the issue.
### Why are the changes needed?
``` python
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
from pyspark.testing.sqlutils import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2,
2)])})
df = spark.createDataFrame(pdf)
df.show()
```
The result is not correct because of incorrect type conversion.
With this PR, type check will be performed:
```
(spark) ➜ spark git:(sadhen/SPARK-35211) ✗ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
21/04/24 17:42:23 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT
/_/
Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.12:4040
Spark context available as 'sc' (master = local[*], app id =
local-1619257343692).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
>>> from pyspark.testing.sqlutils import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1),
ExamplePoint(2, 2)])})
>>> df = spark.createDataFrame(pdf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line
653, in createDataFrame
return super(SparkSession, self).createDataFrame(
File
"/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line
340, in createDataFrame
return self._create_dataframe(data, schema, samplingRatio, verifySchema)
File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line
699, in _create_dataframe
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line
499, in _createFromLocal
data = list(data)
File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line
688, in prepare
verify_func(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1409, in verify
verify_value(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1390, in verify_struct
verifier(v)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1409, in verify
verify_value(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1304, in verify_udf
verifier(dataType.toInternal(obj))
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1409, in verify
verify_value(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1354, in verify_array
element_verifier(i)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1409, in verify
verify_value(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1403, in verify_default
verify_acceptable_types(obj)
File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line
1291, in verify_acceptable_types
raise TypeError(new_msg("%s can not accept object %r in type %s"
TypeError: element in array field point: DoubleType can not accept object 1
in type <class 'int'>
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
unit test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]