[ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
------------------------------
    Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array<int>);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  

object MapObjects {
 private val curId = new java.util.concurrent.atomic.AtomicInteger()
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array<int>);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
!image-2021-05-24-16-15-18-359.png!!image-2021-05-24-16-05-34-334.png!
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. !image-2021-05-24-16-11-20-841.png!


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35500
>                 URL: https://issues.apache.org/jira/browse/SPARK-35500
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yahui Liu
>            Priority: Minor
>              Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array<int>);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  
> object MapObjects {
>  private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  # The time cost for compile is increasing with the growth of column number, 
> for wide table, this cost can more than 2s. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to