sujith71955 removed a comment on issue #22758: [SPARK-25332][SQL] Selecting 
broadcast join for the small size table even query fired via new 
session/context 
URL: https://github.com/apache/spark/pull/22758#issuecomment-431229089
 
 
   @cloud-fan Please find my understanding of the flow as mentioned below, its 
bit tricky :) even though its an interesting flow.
   Lets elaborate this flow, might be we get more suggestions.
   
   
   Step 1 : insert command
   
   insert command --> In DetermineTableStats rule condition will not be met 
since its a OneRowRelation in HiveStrategies(so CatalogTable stats is none) --> 
In HiveMetastoreCatalog flow, convertToLogicalRelation(), Relation will be 
cached with None stats  ---> In InsertIntoHadoopFsRelationCommand flow 
HiveStats will not be set since CatalogTable stats is none.
   
   Step 2 : Execute select statement
   
   Select flow --> In DetermineTableStats rule , we get relation as 
HiveTableRelation so condition met , the flow will update stats with default 
value --> In convertToLogicalRelation(), the LogicalRelation will be fetched 
from cache with stats as none (set in insert command flow), -->,In  
LogicalRelation  computeStatistics, since CatalogTable stats is none , stats 
will be calculated from HadoopFSRelation based on file source.
   
   Result : we get a plan with Broadcast join as per expectation.
   
   Step 3 :
   
   Exit; ( come out of spark-shell)
   
   Step 4: Execute select statement
   
   Select flow -> In DetermineTableStats rule we get relation  as 
HiveTableRelation so condition met , the flow will  update stats with default 
value-->  LogicalRelation cache will not  be available now since its a new 
context, In convertToLogicalRelation() flow  the CatalogTable with updated 
stats(set by DetermineStats) will be added in cache-->while computing 
LogicalRelation stats since no Hivestats is available, it will consider default 
stats of LogicalRelation(set by DetermineStats) .
   
   Result : we get a plan with SortMergejoin , which is wrong.
   
   Step 5: Again Run insert command in same session
   
   insert command --> In DetermineTableStats rule,  condition will not be met 
since its a OneRowRelation in HiveStrategies(so CatalogTable stats is none) --> 
In HiveMetastoreCatalog, convertToLogicalRelation(), Relation is already 
present in cache with default value.(set in select flow of prev step) as per  
flow. ---> This time HiveStats will be recorded , as per 
InsertIntoHadoopFsRelationCommand flow since CatalogTable stats is not none((it 
has default value already set in prev step) .
   
   Step 6 :
   
   exit; ( come out of spark-shell).
   
   Step 7 :
   
   Run Select -> In DetermineTableStats we get  relation as HiveTableRelation, 
condition met the flow will update stats with default value-->Now no 
LogicalRelation cache will be available as per convertToLogicalRelation() flow 
so the CatalogTable with updated stats(set by DetermineStats) will be added in 
cache -->While computing LogicalRelation stats since Hivestats is available(set 
in insert command of prev step), it will consider HiveStats (HiveStats is 
available now which is set in prev step 5).
   Result : This time we get a plan with Broadcast join as per expectation.
   
   But our expectation is always estimate the data size with files for 
convertable relations which is not happening now.
   So overall i can see there is a inconsistency in the insert flow.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to