spark git commit: [SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema

davies Mon, 02 Nov 2015 10:49:57 -0800

Repository: spark
Updated Branches:
  refs/heads/master 71d1c907d -> f92f334ca



[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with 
provided schema

When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
`.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
affected cases where a schema was not provided.

Verifying the first 10 rows is of limited utility and causes the DAG to be 
executed non-lazily. If necessary, I believe this verification should be done 
lazily on all rows. However, since the caller is providing a schema to follow, 
I think it's acceptable to simply fail if the schema is incorrect.

marmbrus We chatted about this at SparkSummitEU. davies you made a similar 
change for the infer-schema path in https://github.com/apache/spark/pull/6606

Author: Jason White <[email protected]>

Closes #9392 from JasonMWhite/createDataFrame_without_take.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f92f334c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f92f334c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f92f334c

Branch: refs/heads/master
Commit: f92f334ca47c03b980b06cf300aa652d0ffa1880
Parents: 71d1c90
Author: Jason White <[email protected]>
Authored: Mon Nov 2 10:49:06 2015 -0800
Committer: Davies Liu <[email protected]>
Committed: Mon Nov 2 10:49:06 2015 -0800

----------------------------------------------------------------------
 python/pyspark/sql/context.py | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/f92f334c/python/pyspark/sql/context.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index 7945365..924bb64 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -318,13 +318,7 @@ class SQLContext(object):
                     struct.names[i] = name
             schema = struct
 
-        elif isinstance(schema, StructType):
-            # take the first few rows to verify schema
-            rows = rdd.take(10)
-            for row in rows:
-                _verify_type(row, schema)
-
-        else:
+        elif not isinstance(schema, StructType):
             raise TypeError("schema should be StructType or list or None, but 
got: %s" % schema)
 
         # convert python objects to sql data


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema

Reply via email to