[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392200#comment-15392200
 ] 

Sylvain Zimmer commented on SPARK-16700:
----------------------------------------

I dug into this a bit more: 

{{_verify_type({}, struct_schema)}} was already raising a similar exception in 
Spark 1.6.2, however schema validation wasn't being enforced at all by 
{{createDataFrame}} : 
https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/context.py#L418

In 2.0.0, it seems that it is done over each row of the data:
https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L504

I think there are 2 issues that should be fixed here:
 - {{_verify_type({}, struct_schema)}} shouldn't raise, because as far as I can 
tell dicts behave as expected and have their items correctly mapped as struct 
fields.
 - There should be a way to go back to 1.6.x-like behaviour and disable schema 
verification in {{createDataFrame}}. The {{prepare()}} function is being 
map()'d over all the data coming from Python, which I think will definitely 
hurt performance for large datasets and complex schemas. Leaving it on by 
default but adding a flag to disable it would be a good solution. Without this 
users will probably have to implement their own {{createDataFrame}} function 
like I did.


> StructType doesn't accept Python dicts anymore
> ----------------------------------------------
>
>                 Key: SPARK-16700
>                 URL: https://issues.apache.org/jira/browse/SPARK-16700
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python <dict> type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
>     SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type <type 'dict'>
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to