Thanks for the response, Quanlong. The behaviour you describe is broadcast
join (versus partitioned / shuffle) - sorry for confusing usage of terms!
Take a look at the differences in the cost model for the two (in lieu of
A partial summary for a shuffle join would be:
02:HASH JOIN 2
| 01:SCAN HDFS 2
00:SCAN HDFS 2
Notice Exchanges on both sides.
On 12 February 2018 at 12:51, Quanlong Huang <huang_quanl...@126.com> wrote:
> IMU, the left side is always located with the hash join node. If the stats
> are correct, the left side will always be a larger table/input. There're
> two terminologies in the hash join algorithm: build and probe. The smaller
> table that can be built into an in-memory hash table is called the "build"
> input. It's represented at the right side. After the in-memory hash table
> is built, the larger table will be scanned and rows will be probed in the
> hash table to find matched results. The larger table is called the "probe"
> input and represented at the left side.So not all rows are sent across the
> network to perform a hash join. Usually the larger table is scanned
> locally. Network traffic comes from the "build" input. It's smaller and
> sometimes can even be represented as a BloomFilter (one kind of
> RuntimeFilter in Impala).
> However, there's still one case that all rows are sent across the network
> anyway. That is when all tables are not located in the Impala cluster (e.g.
> Impala is deployed in a portion of the Hadoop cluster). Scanning the tables
> both consumes network traffic. However, when performing hash join, the
> results of the right side will be sent to the left side, since they have
> smaller size and consumes less network traffic than sending the left side.
> I find this paper in "Impala Reading List" has much more details and
> deserves to be read more times:
> Hash joins and hash teams in Microsoft SQL Server (Graefe, Bunker, Cooper)
> At 2018-02-12 18:13:09, "Jeszy" <jes...@gmail.com> wrote:
> >IIUC, every row scanned in a partitioned hash join (both sides) is sent
> >across the network (an exchange on HASH(key)). The targets of this
> >are nodes that have data locality with the left side of the join. Why does
> >Impala do it that way?
> >Since all rows are sent across the network anyway, Impala could just use
> >all the nodes in the cluster. The upside would be better parallelism for
> >the join itself as well as for all the operators sitting on top of it. Is
> >there a downside I'm forgetting?
> >If not, is there a jira tracking this already? Haven't found one.