[GitHub] [spark] mcdull-zhang opened a new pull request, #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

GitBox Thu, 17 Nov 2022 20:19:28 -0800


mcdull-zhang opened a new pull request, #38703:
URL: https://github.com/apache/spark/pull/38703


   ### What changes were proposed in this pull request?
   For example the following statement:
   ```sql
   cache table t1 as select a from testData3 group by a;
   cache table t2 as select a,b from testData2 where a in (select a from t1);
   select key,value,b from testData t3 join t2 on t3.key=t2.a;
   ```
   The cached t2 is not used in the third statement
   
   before pr:
   ```tex
   Project [key#13, value#14, b#24]
   +- SortMergeJoin [key#13], [a#23], Inner
      :- BroadcastHashJoin [key#13], [a#359], LeftSemi, BuildRight, false
      :  :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
      :  :  +- Scan[obj#12]
      :  +- Scan In-memory table t1 [a#359]
      :        +- InMemoryRelation [a#359], StorageLevel(disk, memory, 
deserialized, 1 replicas)
      :              +- *(2) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
      :                 +- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
      :                    +- *(1) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
      :                       +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
      :                          +- Scan[obj#32]
      +- BroadcastHashJoin [a#23], [a#359], LeftSemi, BuildRight, false
         :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
         :  +- Scan[obj#22]
         +- Scan In-memory table t1 [a#359]
               +- InMemoryRelation [a#359], StorageLevel(disk, memory, 
deserialized, 1 replicas)
                     +- *(2) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
                        +- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
                           +- *(1) HashAggregate(keys=[a#33], functions=[], 
output=[a#33])
                              +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
                                 +- Scan[obj#32]
   ```
   
   after pr:
   ```tex
   Project [key#13, value#14, b#358]
   +- BroadcastHashJoin [key#13], [a#357], Inner, BuildRight, false
      :- SerializeFromObject [knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
      :  +- Scan[obj#12]
      +- Scan In-memory table t2 [a#357, b#358]
            +- InMemoryRelation [a#357, b#358], StorageLevel(disk, memory, 
deserialized, 1 replicas)
                  +- *(1) BroadcastHashJoin [a#23], [a#261], LeftSemi, 
BuildRight, false
                     :- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
                     :  +- Scan[obj#22]
                     +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=155]
                        +- Scan In-memory table t1 [a#261]
                              +- InMemoryRelation [a#261], StorageLevel(disk, 
memory, deserialized, 1 replicas)
                                    +- *(2) HashAggregate(keys=[a#33], 
functions=[], output=[a#33])
                                       +- Exchange hashpartitioning(a#33, 5), 
ENSURE_REQUIREMENTS, [plan_id=92]
                                          +- *(1) HashAggregate(keys=[a#33], 
functions=[], output=[a#33])
                                             +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#33]
                                                +- Scan[obj#32]
   ```
   
   
   ### Why are the changes needed?
   performance improvement
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   added test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mcdull-zhang opened a new pull request, #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

Reply via email to