Franklyn Dsouza created SPARK-13410:
---------------------------------------
Summary: unionAll throws error with DataFrames containing UDT
columns.
Key: SPARK-13410
URL: https://issues.apache.org/jira/browse/SPARK-13410
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.0, 1.5.0
Reporter: Franklyn Dsouza
Unioning two DataFrames that contain UDTs fails with
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}
I tracked this down to this line
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
Which compares datatypes between the output attributes of both logical plans.
However for UDTs this will be a new instance of the UserDefinedType or
PythonUserDefinedType
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
So this equality check will check if the two instances are the same and since
they aren't references to a singleton this check fails. Note: this will work
fine if you are unioning the dataframe with itself.
I have a patch for this which overrides the equality operator on the two
classes here: https://github.com/damnMeddlingKid/spark/pull/2
Reproduction steps
{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]