Nicholas Chammas created SPARK-18866:
----------------------------------------
Summary: Codegen fails with cryptic error if regexp_replace()
output column is not aliased
Key: SPARK-18866
URL: https://issues.apache.org/jira/browse/SPARK-18866
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 2.0.2, 2.1.0
Environment: Java 8, Python 3.5
Reporter: Nicholas Chammas
Priority: Minor
Here's a minimal repro:
{code}
import pyspark
from pyspark.sql import Column, DataFrame
from pyspark.sql.functions import regexp_replace, trim, lower, col
def normalize_udf(column: Column) -> Column:
normalized_column = (
regexp_replace(
column,
pattern='[\s]+',
replacement=' ',
)
)
return normalized_column
if __name__ == '__main__':
spark = pyspark.sql.SparkSession.builder.getOrCreate()
raw_df = spark.createDataFrame(
[(' ',)],
['string'],
)
normalized_df = raw_df.select(normalize_udf('string'))
normalized_df_prime = (
normalized_df
.groupBy(sorted(normalized_df.columns))
.count())
normalized_df_prime.show()
{code}
When I run this I get:
{code}
ERROR CodeGenerator: failed to compile:
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80,
Column 130: Invalid escape sequence
{code}
Followed by a huge barf of generated Java code.
Can you spot the error in my code?
It's simple: I just need to alias the output of {{normalize_udf()}} and all is
forgiven:
{code}
normalized_df = raw_df.select(normalize_udf('string').alias('string'))
{code}
Of course, it's impossible to tell that from the current error output. So my
*first question* is: Is there some way we can better communicate to the user
what went wrong?
Another interesting thing I noticed is that if I try this:
{code}
normalized_df = raw_df.select(lower('string'))
{code}
I immediately get a clean error saying:
{code}
py4j.protocol.Py4JError: An error occurred while calling
z:org.apache.spark.sql.functions.lower. Trace:
py4j.Py4JException: Method lower([class java.lang.String]) does not exist
{code}
I can fix this by building a column object:
{code}
normalized_df = raw_df.select(lower(col('string')))
{code}
So that raises *a second problem/question*: Why does {{lower()}} require that I
build a Column object, whereas {{regexp_replace()}} does not? The inconsistency
adds to the confusion here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]