[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance

Takeshi Yamamuro (Jira) Mon, 24 May 2021 07:27:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350458#comment-17350458
 ]


Takeshi Yamamuro commented on SPARK-35500:
------------------------------------------

Which version did you use? v3.1.0 does not exist, so v3.1.1? I tried to run the 
queries in v3.1.1 to reproduce it, but it couldn't happen;
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)

scala> sql("create table test_code_gen(a array<int>)")
scala> sql("insert into test_code_gen values (array(1, 1))")
scala> sc.setLogLevel("debug")

// The first run
scala> sql("select * from test_code_gen").collect()
...
21/05/24 23:14:00 DEBUG GenerateSafeProjection: code for 
createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, 
ObjectType(interface scala.collection.Seq), make, 
mapobjects(lambdavariable(MapObject, IntegerType, true, -1), 
lambdavariable(MapObject, IntegerType, true, -1), input[0, array<int>, true], 
None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)):
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean resultIsNull_0;
/* 010 */   private int value_MapObject_lambda_variable_1;
/* 011 */   private boolean isNull_MapObject_lambda_variable_1;
/* 012 */   private boolean globalIsNull_0;
/* 013 */   private java.lang.Object[] mutableStateArray_0 = new 
java.lang.Object[1];
/* 014 */
...


// The second run
scala> sql("select * from test_code_gen").collect()
...
21/05/24 23:14:28 DEBUG GenerateSafeProjection: code for 
createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, 
ObjectType(interface scala.collection.Seq), make, 
mapobjects(lambdavariable(MapObject, IntegerType, true, -1), 
lambdavariable(MapObject, IntegerType, true, -1), input[0, array<int>, true], 
None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)):
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean resultIsNull_0;
/* 010 */   private int value_MapObject_lambda_variable_1;
/* 011 */   private boolean isNull_MapObject_lambda_variable_1;
/* 012 */   private boolean globalIsNull_0;
/* 013 */   private java.lang.Object[] mutableStateArray_0 = new 
java.lang.Object[1];
...
 {code}
Actually, this issue should be fixed in 
SPARK-27871([https://github.com/apache/spark/pull/24735]). Or, do I miss 
something?

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35500
>                 URL: https://issues.apache.org/jira/browse/SPARK-35500
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yahui Liu
>            Priority: Minor
>              Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array<int>);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance

Reply via email to