viirya commented on a change in pull request #24599: [SPARK-27701][SQL] Extend
NestedColumnAliasing to general nested field cases including GetArrayStructField
URL: https://github.com/apache/spark/pull/24599#discussion_r292318954
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/NestedSchemaPruningBenchmark.scala
##########
@@ -33,13 +33,17 @@ abstract class NestedSchemaPruningBenchmark extends
SqlBasedBenchmark {
protected val N = 1000000
protected val numIters = 10
- // We use `col1 BIGINT, col2 STRUCT<_1: BIGINT, _2: STRING>` as a test
schema.
- // col1 and col2._1 is used for comparision. col2._2 mimics the burden for
the other columns
+ // We use `col1 BIGINT, col2 STRUCT<_1: BIGINT, _2: STRING>,
+ // col3 ARRAY<STRUCT<_1: BIGINT, _2: STRING>>` as a test schema.
+ // col1, col2._1 and col3._1 are used for comparision. col2._2 and col3._2
mimics the burden
+ // for the other columns
private val df = spark
.range(N * 10)
.sample(false, 0.1)
- .map(x => (x, (x, s"$x" * 100)))
- .toDF("col1", "col2")
+ .map { x =>
+ val col3 = (0 until 10).map(i => (x + i, s"$x" * 10))
+ (x, (x, s"$x" * 100), col3)
+ }.toDF("col1", "col2", "col3")
Review comment:
Thanks @dongjoon-hyun. I made the array less heavy.
Btw, previously the test creates temp views `t1`, `t2` for benchmark cases,
individually. The added case reuses `t2`, but to follow with previous cases, I
created `t3` now.
Also compared the query plans from previous and this PR. They are actually
the same locally.
Previous:
```
[info] == Physical Plan ==
[info] *(1) Project [col2#26._1 AS _1#239L]
[info] +- *(1) FileScan orc [col2#26] Batched: false, DataFilters: [],
Format: ORC, Location:
InMemoryFileIndex[file:/spark/target/tmp/spark-3e4433f7-c92e-4017-9efb-3bbd2...,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<col2:struct<_1:bigint>>
```
This PR:
```
[info] == Physical Plan ==
[info] *(1) Project [col2#26._1 AS _1#231L]
[info] +- *(1) FileScan orc [col2#26] Batched: false, DataFilters: [],
Format: ORC, Location:
InMemoryFileIndex[file:/spark/target/tmp/spark-11a5c59b-5b37-4eff-8fdd-398c7...,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<col2:struct<_1:bigint>>
```
The benchmark results don't differ in local test.
Previous:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.5
[info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
[info] Selection: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] Top-level column 86
120 48 11.6 86.4 1.0X
[info] Nested column 1081
1097 9 0.9 1080.9 0.1X
```
This PR:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.5
[info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
[info] Selection: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] Top-level column 80
89 10 12.5 79.9 1.0X
[info] Nested column 1038
1052 9 1.0 1038.4 0.1X
[info] Nested column in array 2576
2603 18 0.4 2575.7 0.0X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]