[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392200#comment-15392200 ]
Sylvain Zimmer commented on SPARK-16700: ---------------------------------------- I dug into this a bit more: {{_verify_type({}, struct_schema)}} was already raising a similar exception in Spark 1.6.2, however schema validation wasn't being enforced at all by {{createDataFrame}} : https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/context.py#L418 In 2.0.0, it seems that it is done over each row of the data: https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L504 I think there are 2 issues that should be fixed here: - {{_verify_type({}, struct_schema)}} shouldn't raise, because as far as I can tell dicts behave as expected and have their items correctly mapped as struct fields. - There should be a way to go back to 1.6.x-like behaviour and disable schema verification in {{createDataFrame}}. The {{prepare()}} function is being map()'d over all the data coming from Python, which I think will definitely hurt performance for large datasets and complex schemas. Leaving it on by default but adding a flag to disable it would be a good solution. Without this users will probably have to implement their own {{createDataFrame}} function like I did. > StructType doesn't accept Python dicts anymore > ---------------------------------------------- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python <dict> type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type <type 'dict'> > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org