Michel Davit created SPARK-23986:

             Summary: CompileException when using too many avg aggregation 
after joining
                 Key: SPARK-23986
                 URL: https://issues.apache.org/jira/browse/SPARK-23986
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Michel Davit

Considering the following code:
    val df1: DataFrame = sparkSession.sparkContext
      .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
      .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")

    val df2: DataFrame = sparkSession.sparkContext
      .makeRDD(Seq((0, "val1", "val2")))
      .toDF("key", "dummy1", "dummy2")

    val agg = df1
      .join(df2, df1("key") === df2("key"), "leftouter")

    val head = agg.take(1)
This logs the following exception:
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
467, Column 28: Redefinition of parameter "agg_expr_11"
I am not a spark expert but after investigation, I realized that the generated 
{{doConsume}} method is responsible of the exception.

Indeed, {{avg}} calls several times 
The 1st time with the 'avg' Expr and a second time for the base aggregation 
Expr (count and sum).

The problem comes from the generation of parameters in CodeGenerator:
   * Returns a term name that is unique within this instance of a 
  def freshName(name: String): String = synchronized {
    val fullName = if (freshNamePrefix == "") {
    } else {
    if (freshNameIds.contains(fullName)) {
      val id = freshNameIds(fullName)
      freshNameIds(fullName) = id + 1
    } else {
      freshNameIds += fullName -> 1
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
 The second call is made with {{agg_expr_[1..12]}} and generates the following 
 {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name conflicts 
in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.

Appending the 'id' in s"$fullName$id" to generate unique term name is source of 
conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to