Github user sujith71955 commented on the issue: https://github.com/apache/spark/pull/22758 @cloud-fan Please find my understanding of the flow as mentioned below, its bit tricky :) Lets elaborate this flow might be we get more suggestions. Step 1 : insert command insert command --> In DetermineTableStats condition will not be met since its a OneTableRelation in HiveStrategies(so CatalogTable stats is none) --> In convertToLogicalRelation(), Relation will be saved with None stats as per HiveMetastoreCatalog flow ---> HiveStats will not be set since CatalogTable stats is none, as per InsertIntoHadoopFsRelationCommand flow Step 2 : Execute select statement Select flow --> In DetermineTableStats we get as HiveTableRelation condition met so update stats with default value --> In convertToLogicalRelation(), the LogicalRelation will be fetched from cache with stats as none (set in insert command flow), -->,While computing LogicalRelation since CatalogTable stats is none , stats will be calculated from HadoopFSRelation based on file source Result : we get a plan with Broadcast join as per expectation. Step 3 : Exit; ( come out of spark-shell) Step 4: Execute select statement Select flow -> In DetermineTableStats we get as HiveTableRelation condition met so update stats with default value--> Now no LogicalRelation cache will be available as per convertToLogicalRelation() flow so the CatalogTable with updated stats(set by DetermineStats) will be added in cache-->while computing LogicalRelation stats since no Hivestats is available, it will consider default stats of LogicalRelation(set by DetermineStats) . Result : we get a plan with SortMergejoin , which is wrong. Step 5: Again Run insert command in same session insert command --> In DetermineTableStats condition will not be met since its a OneTableRelation in HiveStrategies(so CatalogTable stats is none) --> In convertToLogicalRelation(), Relation is already present in cache with default value(set in select flow of prev step) as per HiveMetastoreCatalog flow. ---> This time HiveStats will be recorded , since CatalogTable stats is not none it has default value already set as per InsertIntoHadoopFsRelationCommand flow. Step 6 : exit; ( come out of spark-shell). Step 7 : Run Select -> In DetermineTableStats we get as HiveTableRelation condition met so update stats with default value-->Now no LogicalRelation cache will be available as per convertToLogicalRelation() flow so the CatalogTable with updated stats(set by DetermineStats) will be added in cache -->While computing LogicalRelation stats since Hivestats is available(set in insert command of prev step), it will consider HiveStats (HiveStats is available now which is set in prev step 5). Result : This time we get a plan with Broadcast join as per expectation. So overall i can see there is a inconsistency in the flow which is happening due to the DetermineStats flow ,which is setting default stats in CatalogTable for even convertable type relations.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org