Thanks for the response, Quanlong. The behaviour you describe is broadcast join (versus partitioned / shuffle) - sorry for confusing usage of terms! Take a look at the differences in the cost model for the two (in lieu of better description): https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java#L444-L503
A partial summary for a shuffle join would be: Operator #Hosts ------------------------ 05:EXCHANGE 1 02:HASH JOIN 2 |--04:EXCHANGE 2 | 01:SCAN HDFS 2 03:EXCHANGE 2 00:SCAN HDFS 2 Notice Exchanges on both sides. On 12 February 2018 at 12:51, Quanlong Huang <huang_quanl...@126.com> wrote: > IMU, the left side is always located with the hash join node. If the stats > are correct, the left side will always be a larger table/input. There're > two terminologies in the hash join algorithm: build and probe. The smaller > table that can be built into an in-memory hash table is called the "build" > input. It's represented at the right side. After the in-memory hash table > is built, the larger table will be scanned and rows will be probed in the > hash table to find matched results. The larger table is called the "probe" > input and represented at the left side.So not all rows are sent across the > network to perform a hash join. Usually the larger table is scanned > locally. Network traffic comes from the "build" input. It's smaller and > sometimes can even be represented as a BloomFilter (one kind of > RuntimeFilter in Impala). > > However, there's still one case that all rows are sent across the network > anyway. That is when all tables are not located in the Impala cluster (e.g. > Impala is deployed in a portion of the Hadoop cluster). Scanning the tables > both consumes network traffic. However, when performing hash join, the > results of the right side will be sent to the left side, since they have > smaller size and consumes less network traffic than sending the left side. > > > I find this paper in "Impala Reading List" has much more details and > deserves to be read more times: > Hash joins and hash teams in Microsoft SQL Server (Graefe, Bunker, Cooper) > > > HTH > > > At 2018-02-12 18:13:09, "Jeszy" <jes...@gmail.com> wrote: > >IIUC, every row scanned in a partitioned hash join (both sides) is sent > >across the network (an exchange on HASH(key)). The targets of this > exchange > >are nodes that have data locality with the left side of the join. Why does > >Impala do it that way? > > > >Since all rows are sent across the network anyway, Impala could just use > >all the nodes in the cluster. The upside would be better parallelism for > >the join itself as well as for all the operators sitting on top of it. Is > >there a downside I'm forgetting? > >If not, is there a jira tracking this already? Haven't found one. > > > >Thanks! >