Github user yucai commented on the issue: https://github.com/apache/spark/pull/22847 I used the WideTableBenchmark to test this configuration. 4 scenarioes are tested, `2048` is always better than `1024`, overall it is also good and looks more safe to avoid hitting 8KB limitaion. **Scenario** 1. projection on wide table: simple ``` val N = 1 << 20 val df = spark.range(N) val columns = (0 until 400).map{ i => s"id as id$i"} df.selectExpr(columns: _*).foreach(identity(_)) ``` 2. projection on wide table: long alias names ``` val longName = "averylongaliasname" * 20 val columns = (0 until 400).map{ i => s"id as ${longName}_id$i"} df.selectExpr(columns: _*).foreach(identity(_)) ``` 3. projection on wide table: many complex expressions ``` // 400 columns, whole stage codegen is disabled for spark.sql.codegen.maxFields val columns = (0 until 400).map{ i => s"case when id = $i then $i else 800 end as id$i"} df.selectExpr(columns: _*).foreach(identity(_)) ``` 4. projection on wide table: a big complex expressions ``` // Because of spark.sql.subexpressionElimination.enabled, // the whole case when codes will be put into one function, // and it will be invoked once only. val columns = (0 until 400).map{ i => s"case when id = ${N + 1} then 1 when id = ${N + 2} then 1 ... when id = ${N + 6} then 1 else sqrt(N) end as id$i"} df.selectExpr(columns: _*).foreach(identity(_)) ``` **Perf Results** ``` ================================================================================================ projection on wide table: simple ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz projection on wide table: simple: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ split threshold 10 7553 / 7676 0.1 7202.7 1.0X split threshold 100 5463 / 5504 0.2 5210.0 1.4X split threshold 1024 2981 / 3017 0.4 2843.0 2.5X split threshold 2048 2857 / 2897 0.4 2724.2 2.6X split threshold 4096 3128 / 3187 0.3 2983.3 2.4X split threshold 8196 3755 / 3793 0.3 3581.3 2.0X split threshold 65536 27616 / 27685 0.0 26336.2 0.3X ================================================================================================ projection on wide table: long alias names ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz projection on wide table: long alias names: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ split threshold 10 7513 / 7566 0.1 7164.6 1.0X split threshold 100 5363 / 5410 0.2 5114.4 1.4X split threshold 1024 2966 / 2998 0.4 2828.3 2.5X split threshold 2048 2840 / 2864 0.4 2708.0 2.6X split threshold 4096 3126 / 3166 0.3 2981.2 2.4X split threshold 8196 3756 / 3823 0.3 3582.3 2.0X split threshold 65536 27542 / 27729 0.0 26266.4 0.3X ================================================================================================ projection on wide table: many complex expressions ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz projection on wide table: complex expressions 1: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ split threshold 10 8758 / 9007 0.1 8352.3 1.0X split threshold 100 8675 / 8754 0.1 8272.9 1.0X split threshold 1024 3696 / 3797 0.3 3524.7 2.4X split threshold 2048 3349 / 3419 0.3 3193.9 2.6X split threshold 4096 2967 / 3019 0.4 2829.1 3.0X split threshold 8196 3757 / 3841 0.3 3582.5 2.3X split threshold 65536 39805 / 40309 0.0 37960.7 0.2X ================================================================================================ projection on wide table: a big complex expressions ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz projection on wide table: complex expressions 2: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ split threshold 10 6349 / 6523 0.2 6054.7 1.0X split threshold 100 4804 / 4912 0.2 4581.4 1.3X split threshold 1024 3145 / 3193 0.3 2999.4 2.0X split threshold 2048 3089 / 3124 0.3 2945.9 2.1X split threshold 4096 2987 / 3060 0.4 2848.3 2.1X split threshold 8196 3705 / 3718 0.3 3533.4 1.7X split threshold 65536 17102 / 17156 0.1 16310.1 0.4X ``` Complete benchmark source: [WideTableBenchmark.scala](https://github.com/yucai/spark/blob/betterSplitThreshold/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideTableBenchmark.scala) And I have another finding. In above testing, the wide table's columns are more than 100, so whole stage code gen is disabled, because of `spark.sql.codegen.maxFields`. If the wide table's columns are less than 100, whole stage code gen is enabled, but the expressions in `ProjectExec`'s could not be split. Like below: ``` /* 133 */ private void project_doConsume_0(long project_expr_0_0) throws java.io.IOException { /* 134 */ // CONSUME: WholeStageCodegen /* 135 */ // CASE WHEN (input[0, bigint, false] = 0) THEN 0 ELSE 800 END /* 136 */ byte project_caseWhenResultState_0 = -1; /* 137 */ do { /* 138 */ // (input[0, bigint, false] = 0) /* 139 */ boolean project_value_2 = false; /* 140 */ project_value_2 = project_expr_0_0 == 0L; /* 141 */ if (!false && project_value_2) { /* 142 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); /* 143 */ project_project_value_1_0 = 0; /* 144 */ continue; ... ``` `project_doConsume_0` is almost 2000 lines, whose byte code are probably more than 8KB. Not sure if it is a known issue?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org