Repository: spark Updated Branches: refs/heads/master 71d1c907d -> f92f334ca
[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <[email protected]> Closes #9392 from JasonMWhite/createDataFrame_without_take. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f92f334c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f92f334c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f92f334c Branch: refs/heads/master Commit: f92f334ca47c03b980b06cf300aa652d0ffa1880 Parents: 71d1c90 Author: Jason White <[email protected]> Authored: Mon Nov 2 10:49:06 2015 -0800 Committer: Davies Liu <[email protected]> Committed: Mon Nov 2 10:49:06 2015 -0800 ---------------------------------------------------------------------- python/pyspark/sql/context.py | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/f92f334c/python/pyspark/sql/context.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py index 7945365..924bb64 100644 --- a/python/pyspark/sql/context.py +++ b/python/pyspark/sql/context.py @@ -318,13 +318,7 @@ class SQLContext(object): struct.names[i] = name schema = struct - elif isinstance(schema, StructType): - # take the first few rows to verify schema - rows = rdd.take(10) - for row in rows: - _verify_type(row, schema) - - else: + elif not isinstance(schema, StructType): raise TypeError("schema should be StructType or list or None, but got: %s" % schema) # convert python objects to sql data --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
