IIUC, every row scanned in a partitioned hash join (both sides) is sent across the network (an exchange on HASH(key)). The targets of this exchange are nodes that have data locality with the left side of the join. Why does Impala do it that way?
Since all rows are sent across the network anyway, Impala could just use all the nodes in the cluster. The upside would be better parallelism for the join itself as well as for all the operators sitting on top of it. Is there a downside I'm forgetting? If not, is there a jira tracking this already? Haven't found one. Thanks!