dongjoon-hyun commented on a change in pull request #24599: [SPARK-27701][SQL]
Extend NestedColumnAliasing to general nested field cases including
GetArrayStructField
URL: https://github.com/apache/spark/pull/24599#discussion_r292545297
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/NestedSchemaPruningBenchmark.scala
##########
@@ -33,13 +33,17 @@ abstract class NestedSchemaPruningBenchmark extends
SqlBasedBenchmark {
protected val N = 1000000
protected val numIters = 10
- // We use `col1 BIGINT, col2 STRUCT<_1: BIGINT, _2: STRING>` as a test
schema.
- // col1 and col2._1 is used for comparision. col2._2 mimics the burden for
the other columns
+ // We use `col1 BIGINT, col2 STRUCT<_1: BIGINT, _2: STRING>,
+ // col3 ARRAY<STRUCT<_1: BIGINT, _2: STRING>>` as a test schema.
+ // col1, col2._1 and col3._1 are used for comparision. col2._2 and col3._2
mimics the burden
+ // for the other columns
private val df = spark
.range(N * 10)
.sample(false, 0.1)
- .map(x => (x, (x, s"$x" * 100)))
- .toDF("col1", "col2")
+ .map { x =>
+ val col3 = (0 until 10).map(i => (x + i, s"$x" * 10))
+ (x, (x, s"$x" * 100), col3)
+ }.toDF("col1", "col2", "col3")
Review comment:
The regression is on master branch. For Parquet, it looks good. But ORC, I
got the following.
```
Nested Schema Pruning Benchmark For ORC v1
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_201-b09 on Linux 3.10.0-862.3.2.el7.x86_64
+OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Selection: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Top-level column 131 150
25 7.7 130.6 1.0X
-Nested column 922 954
21 1.1 922.2 0.1X
+Top-level column 125 162
23 8.0 125.5 1.0X
+Nested column 1210 1272
92 0.8 1209.9 0.1X
-OpenJDK 64-Bit Server VM 1.8.0_201-b09 on Linux 3.10.0-862.3.2.el7.x86_64
+OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Limiting: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Top-level column 446 477
50 2.2 445.5 1.0X
-Nested column 1328 1366
44 0.8 1328.4 0.3X
+Top-level column 432 468
30 2.3 432.1 1.0X
+Nested column 1632 1685
77 0.6 1631.5 0.3X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]