[ 
https://issues.apache.org/jira/browse/SPARK-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18866:
-------------------------------------
    Description: 
Here's a minimal repro:

{code}
import pyspark
from pyspark.sql import Column
from pyspark.sql.functions import regexp_replace, lower, col


def normalize_udf(column: Column) -> Column:
    normalized_column = (
        regexp_replace(
            column,
            pattern='[\s]+',
            replacement=' ',
        )
    )
    return normalized_column


if __name__ == '__main__':
    spark = pyspark.sql.SparkSession.builder.getOrCreate()
    raw_df = spark.createDataFrame(
        [('          ',)],
        ['string'],
    )
    normalized_df = raw_df.select(normalize_udf('string'))
    normalized_df_prime = (
        normalized_df
        .groupBy(sorted(normalized_df.columns))
        .count())
    normalized_df_prime.show()
{code}

When I run this I get:

{code}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, 
Column 130: Invalid escape sequence
{code}

Followed by a huge barf of generated Java code, _and then the output I expect_. 
(So despite the scary error, the code actually works!)

Can you spot the error in my code?

It's simple: I just need to alias the output of {{normalize_udf()}} and all is 
forgiven:

{code}
normalized_df = raw_df.select(normalize_udf('string').alias('string'))
{code}

Of course, it's impossible to tell that from the current error output. So my 
*first question* is: Is there some way we can better communicate to the user 
what went wrong?

Another interesting thing I noticed is that if I try this:

{code}
normalized_df = raw_df.select(lower('string'))
{code}

I immediately get a clean error saying:

{code}
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.lower. Trace:
py4j.Py4JException: Method lower([class java.lang.String]) does not exist
{code}

I can fix this by building a column object:

{code}
normalized_df = raw_df.select(lower(col('string')))
{code}

So that raises *a second problem/question*: Why does {{lower()}} require that I 
build a Column object, whereas {{regexp_replace()}} does not? The inconsistency 
adds to the confusion here.

  was:
Here's a minimal repro:

{code}
import pyspark
from pyspark.sql import Column
from pyspark.sql.functions import regexp_replace, lower, col


def normalize_udf(column: Column) -> Column:
    normalized_column = (
        regexp_replace(
            column,
            pattern='[\s]+',
            replacement=' ',
        )
    )
    return normalized_column


if __name__ == '__main__':
    spark = pyspark.sql.SparkSession.builder.getOrCreate()
    raw_df = spark.createDataFrame(
        [('          ',)],
        ['string'],
    )
    normalized_df = raw_df.select(normalize_udf('string'))
    normalized_df_prime = (
        normalized_df
        .groupBy(sorted(normalized_df.columns))
        .count())
    normalized_df_prime.show()
{code}

When I run this I get:

{code}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, 
Column 130: Invalid escape sequence
{code}

Followed by a huge barf of generated Java code.

Can you spot the error in my code?

It's simple: I just need to alias the output of {{normalize_udf()}} and all is 
forgiven:

{code}
normalized_df = raw_df.select(normalize_udf('string').alias('string'))
{code}

Of course, it's impossible to tell that from the current error output. So my 
*first question* is: Is there some way we can better communicate to the user 
what went wrong?

Another interesting thing I noticed is that if I try this:

{code}
normalized_df = raw_df.select(lower('string'))
{code}

I immediately get a clean error saying:

{code}
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.lower. Trace:
py4j.Py4JException: Method lower([class java.lang.String]) does not exist
{code}

I can fix this by building a column object:

{code}
normalized_df = raw_df.select(lower(col('string')))
{code}

So that raises *a second problem/question*: Why does {{lower()}} require that I 
build a Column object, whereas {{regexp_replace()}} does not? The inconsistency 
adds to the confusion here.


> Codegen fails with cryptic error if regexp_replace() output column is not 
> aliased
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-18866
>                 URL: https://issues.apache.org/jira/browse/SPARK-18866
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.2, 2.1.0
>         Environment: Java 8, Python 3.5
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Here's a minimal repro:
> {code}
> import pyspark
> from pyspark.sql import Column
> from pyspark.sql.functions import regexp_replace, lower, col
> def normalize_udf(column: Column) -> Column:
>     normalized_column = (
>         regexp_replace(
>             column,
>             pattern='[\s]+',
>             replacement=' ',
>         )
>     )
>     return normalized_column
> if __name__ == '__main__':
>     spark = pyspark.sql.SparkSession.builder.getOrCreate()
>     raw_df = spark.createDataFrame(
>         [('          ',)],
>         ['string'],
>     )
>     normalized_df = raw_df.select(normalize_udf('string'))
>     normalized_df_prime = (
>         normalized_df
>         .groupBy(sorted(normalized_df.columns))
>         .count())
>     normalized_df_prime.show()
> {code}
> When I run this I get:
> {code}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 80, Column 130: Invalid escape sequence
> {code}
> Followed by a huge barf of generated Java code, _and then the output I 
> expect_. (So despite the scary error, the code actually works!)
> Can you spot the error in my code?
> It's simple: I just need to alias the output of {{normalize_udf()}} and all 
> is forgiven:
> {code}
> normalized_df = raw_df.select(normalize_udf('string').alias('string'))
> {code}
> Of course, it's impossible to tell that from the current error output. So my 
> *first question* is: Is there some way we can better communicate to the user 
> what went wrong?
> Another interesting thing I noticed is that if I try this:
> {code}
> normalized_df = raw_df.select(lower('string'))
> {code}
> I immediately get a clean error saying:
> {code}
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.sql.functions.lower. Trace:
> py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> {code}
> I can fix this by building a column object:
> {code}
> normalized_df = raw_df.select(lower(col('string')))
> {code}
> So that raises *a second problem/question*: Why does {{lower()}} require that 
> I build a Column object, whereas {{regexp_replace()}} does not? The 
> inconsistency adds to the confusion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to