sujith71955 removed a comment on issue #22758: [SPARK-25332][SQL] Selecting broadcast join for the small size table even query fired via new session/context URL: https://github.com/apache/spark/pull/22758#issuecomment-431229089 @cloud-fan Please find my understanding of the flow as mentioned below, its bit tricky :) even though its an interesting flow. Lets elaborate this flow, might be we get more suggestions. Step 1 : insert command insert command --> In DetermineTableStats rule condition will not be met since its a OneRowRelation in HiveStrategies(so CatalogTable stats is none) --> In HiveMetastoreCatalog flow, convertToLogicalRelation(), Relation will be cached with None stats ---> In InsertIntoHadoopFsRelationCommand flow HiveStats will not be set since CatalogTable stats is none. Step 2 : Execute select statement Select flow --> In DetermineTableStats rule , we get relation as HiveTableRelation so condition met , the flow will update stats with default value --> In convertToLogicalRelation(), the LogicalRelation will be fetched from cache with stats as none (set in insert command flow), -->,In LogicalRelation computeStatistics, since CatalogTable stats is none , stats will be calculated from HadoopFSRelation based on file source. Result : we get a plan with Broadcast join as per expectation. Step 3 : Exit; ( come out of spark-shell) Step 4: Execute select statement Select flow -> In DetermineTableStats rule we get relation as HiveTableRelation so condition met , the flow will update stats with default value--> LogicalRelation cache will not be available now since its a new context, In convertToLogicalRelation() flow the CatalogTable with updated stats(set by DetermineStats) will be added in cache-->while computing LogicalRelation stats since no Hivestats is available, it will consider default stats of LogicalRelation(set by DetermineStats) . Result : we get a plan with SortMergejoin , which is wrong. Step 5: Again Run insert command in same session insert command --> In DetermineTableStats rule, condition will not be met since its a OneRowRelation in HiveStrategies(so CatalogTable stats is none) --> In HiveMetastoreCatalog, convertToLogicalRelation(), Relation is already present in cache with default value.(set in select flow of prev step) as per flow. ---> This time HiveStats will be recorded , as per InsertIntoHadoopFsRelationCommand flow since CatalogTable stats is not none((it has default value already set in prev step) . Step 6 : exit; ( come out of spark-shell). Step 7 : Run Select -> In DetermineTableStats we get relation as HiveTableRelation, condition met the flow will update stats with default value-->Now no LogicalRelation cache will be available as per convertToLogicalRelation() flow so the CatalogTable with updated stats(set by DetermineStats) will be added in cache -->While computing LogicalRelation stats since Hivestats is available(set in insert command of prev step), it will consider HiveStats (HiveStats is available now which is set in prev step 5). Result : This time we get a plan with Broadcast join as per expectation. But our expectation is always estimate the data size with files for convertable relations which is not happening now. So overall i can see there is a inconsistency in the insert flow.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
