IIUC, every row scanned in a partitioned hash join (both sides) is sent
across the network (an exchange on HASH(key)). The targets of this exchange
are nodes that have data locality with the left side of the join. Why does
Impala do it that way?
Since all rows are sent across the network anyway, Impala could just use
all the nodes in the cluster. The upside would be better parallelism for
the join itself as well as for all the operators sitting on top of it. Is
there a downside I'm forgetting?
If not, is there a jira tracking this already? Haven't found one.