Nicholas Chammas created SPARK-18254:
----------------------------------------
Summary: UDFs don't see aliased column names; somehow they get the
original names
Key: SPARK-18254
URL: https://issues.apache.org/jira/browse/SPARK-18254
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.1
Environment: Python 3.5, Java 8
Reporter: Nicholas Chammas
Dunno if I'm misinterpreting something here, but this seems like a bug in how
UDFs work, or in how they interface with the optimizer.
Here's a basic reproduction:
{code}
import pyspark
from pyspark.sql import Row
from pyspark.sql.functions import udf, col, struct
def length(full_name):
# The non-aliased names, FIRST and LAST, show up here, instead of
# first_name and last_name.
# print(full_name)
return len(full_name.first_name) + len(full_name.last_name)
if __name__ == '__main__':
spark = (
pyspark.sql.SparkSession.builder
.getOrCreate())
length_udf = udf(length)
names = spark.createDataFrame([
Row(FIRST='Nick', LAST='Chammas'),
Row(FIRST='Walter', LAST='Williams'),
])
names_cleaned = (
names
.select(
col('FIRST').alias('first_name'),
col('LAST').alias('last_name'),
)
.withColumn('full_name', struct('first_name', 'last_name'))
.select('full_name'))
# We see the schema we expect here.
names_cleaned.printSchema()
# However, here we get an AttributeError. length_udf() cannot
# find first_name or last_name.
(names_cleaned
.withColumn('length', length_udf('full_name'))
.show())
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]