[ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
------------------------------
    Affects Version/s:     (was: 3.1.0)
                       2.4.5

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35500
>                 URL: https://issues.apache.org/jira/browse/SPARK-35500
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Yahui Liu
>            Priority: Minor
>              Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array<int>);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to