mcdull-zhang opened a new pull request, #38703:
URL: https://github.com/apache/spark/pull/38703
### What changes were proposed in this pull request?
For example the following statement:
```sql
cache table t1 as select a from testData3 group by a;
cache table t2 as select a,b from testData2 where a in (select a from t1);
select key,value,b from testData t3 join t2 on t3.key=t2.a;
```
The cached t2 is not used in the third statement
before pr:
```tex
Project [key#13, value#14, b#24]
+- SortMergeJoin [key#13], [a#23], Inner
:- BroadcastHashJoin [key#13], [a#359], LeftSemi, BuildRight, false
: :- SerializeFromObject [knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13,
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
fromString, knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false,
true) AS value#14]
: : +- Scan[obj#12]
: +- Scan In-memory table t1 [a#359]
: +- InMemoryRelation [a#359], StorageLevel(disk, memory,
deserialized, 1 replicas)
: +- *(2) HashAggregate(keys=[a#33], functions=[],
output=[a#33])
: +- Exchange hashpartitioning(a#33, 5),
ENSURE_REQUIREMENTS, [plan_id=92]
: +- *(1) HashAggregate(keys=[a#33], functions=[],
output=[a#33])
: +- *(1) SerializeFromObject
[knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
: +- Scan[obj#32]
+- BroadcastHashJoin [a#23], [a#359], LeftSemi, BuildRight, false
:- SerializeFromObject [knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23,
knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
: +- Scan[obj#22]
+- Scan In-memory table t1 [a#359]
+- InMemoryRelation [a#359], StorageLevel(disk, memory,
deserialized, 1 replicas)
+- *(2) HashAggregate(keys=[a#33], functions=[],
output=[a#33])
+- Exchange hashpartitioning(a#33, 5),
ENSURE_REQUIREMENTS, [plan_id=92]
+- *(1) HashAggregate(keys=[a#33], functions=[],
output=[a#33])
+- *(1) SerializeFromObject
[knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
+- Scan[obj#32]
```
after pr:
```tex
Project [key#13, value#14, b#358]
+- BroadcastHashJoin [key#13], [a#357], Inner, BuildRight, false
:- SerializeFromObject [knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13,
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
fromString, knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false,
true) AS value#14]
: +- Scan[obj#12]
+- Scan In-memory table t2 [a#357, b#358]
+- InMemoryRelation [a#357, b#358], StorageLevel(disk, memory,
deserialized, 1 replicas)
+- *(1) BroadcastHashJoin [a#23], [a#261], LeftSemi,
BuildRight, false
:- *(1) SerializeFromObject
[knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23,
knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
: +- Scan[obj#22]
+- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false),
[plan_id=155]
+- Scan In-memory table t1 [a#261]
+- InMemoryRelation [a#261], StorageLevel(disk,
memory, deserialized, 1 replicas)
+- *(2) HashAggregate(keys=[a#33],
functions=[], output=[a#33])
+- Exchange hashpartitioning(a#33, 5),
ENSURE_REQUIREMENTS, [plan_id=92]
+- *(1) HashAggregate(keys=[a#33],
functions=[], output=[a#33])
+- *(1) SerializeFromObject
[knownnotnull(assertnotnull(input[0,
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
+- Scan[obj#32]
```
### Why are the changes needed?
performance improvement
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]