Github user gberger commented on the issue: https://github.com/apache/spark/pull/19792 @HyukjinKwon done, with test added. ``` >>> spark.createDataFrame(spark.sparkContext.parallelize([[None, 1], ["a", None], [1, 1]]), schema=["a", "b"], samplingRatio=0.99) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 644, in createDataFrame rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 383, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio, names=schema) File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 375, in _inferSchema schema = rdd.map(lambda row: _infer_schema(row, names)).reduce(_merge_type) File "/Users/gberger/Projects/spark/python/pyspark/rdd.py", line 852, in reduce return reduce(f, vals) File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1133, in _merge_type for f in a.fields] File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1126, in _merge_type raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b)))) TypeError: field a: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.LongType'> ``` Also, with this last change, I could simplify the code in `_createFromRDD`. Since I pass the field names down to `_inferSchema` (and to `_infer_schema` from there), the inferred schema already comes with field names, so no need to set them again in `_createFromRDD`. Tests for this still pass. Let me know if you can think of any edge case not covered by tests that would break.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org