Guilherme Berger created SPARK-22566:
----------------------------------------
Summary: Better error message for `_merge_type` in Pandas to Spark
DF conversion
Key: SPARK-22566
URL: https://issues.apache.org/jira/browse/SPARK-22566
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 2.2.0
Reporter: Guilherme Berger
Priority: Minor
When creating a Spark DF from a Pandas DF without specifying a schema, schema
inference is used. This inference can fail when a column contains values of two
different types; this is ok. The problem is the error message does not tell us
in which column this happened.
When this happens, it is painful to debug since the error message is too vague.
I plan on submitting a PR which fixes this, providing a better error message
for such cases, containing the column name (and possibly the problematic values
too).
>>> spark_session.createDataFrame(pandas_df)
File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
for f in a.fields]
File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type <class 'pyspark.sql.types.LongType'> and <class
'pyspark.sql.types.StringType'>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]