Jeszy, the way I read your question is: How much inter-node parallelism is
As usual with perf question the answer is "it depends". Involving all nodes
in the cluster for a PHJ may not work well. Intuitively, each node should
have a minimum amount of work for the cost of shipping fragments there to
be worth it. So ultimately, you need to estimate how much work each node is
going to do - which is hard (planner estimates). Involving too many nodes
can make the query slower (query startup) and less efficient (work vs.
query startup). Further, involving more nodes can exacerbate the thrift
connection issues we're aware of.
The benefit of the current policy is that it is simple and does not rely
much on planner estimates. Simple is typically more robust than fancy
because similar queries tend to have similar plans (and degrees of
Feel free to file a JIRA if you think this should be addressed. However, I
don't think the right policy is "always run on all nodes".
On Mon, Feb 12, 2018 at 4:18 AM, Jeszy <jes...@gmail.com> wrote:
> Thanks for the response, Quanlong. The behaviour you describe is broadcast
> join (versus partitioned / shuffle) - sorry for confusing usage of terms!
> Take a look at the differences in the cost model for the two (in lieu of
> better description):
> A partial summary for a shuffle join would be:
> Operator #Hosts
> 05:EXCHANGE 1
> 02:HASH JOIN 2
> |--04:EXCHANGE 2
> | 01:SCAN HDFS 2
> 03:EXCHANGE 2
> 00:SCAN HDFS 2
> Notice Exchanges on both sides.
> On 12 February 2018 at 12:51, Quanlong Huang <huang_quanl...@126.com>
> > IMU, the left side is always located with the hash join node. If the
> > are correct, the left side will always be a larger table/input. There're
> > two terminologies in the hash join algorithm: build and probe. The
> > table that can be built into an in-memory hash table is called the
> > input. It's represented at the right side. After the in-memory hash table
> > is built, the larger table will be scanned and rows will be probed in the
> > hash table to find matched results. The larger table is called the
> > input and represented at the left side.So not all rows are sent across
> > network to perform a hash join. Usually the larger table is scanned
> > locally. Network traffic comes from the "build" input. It's smaller and
> > sometimes can even be represented as a BloomFilter (one kind of
> > RuntimeFilter in Impala).
> > However, there's still one case that all rows are sent across the network
> > anyway. That is when all tables are not located in the Impala cluster
> > Impala is deployed in a portion of the Hadoop cluster). Scanning the
> > both consumes network traffic. However, when performing hash join, the
> > results of the right side will be sent to the left side, since they have
> > smaller size and consumes less network traffic than sending the left
> > I find this paper in "Impala Reading List" has much more details and
> > deserves to be read more times:
> > Hash joins and hash teams in Microsoft SQL Server (Graefe, Bunker,
> > HTH
> > At 2018-02-12 18:13:09, "Jeszy" <jes...@gmail.com> wrote:
> > >IIUC, every row scanned in a partitioned hash join (both sides) is sent
> > >across the network (an exchange on HASH(key)). The targets of this
> > exchange
> > >are nodes that have data locality with the left side of the join. Why
> > >Impala do it that way?
> > >
> > >Since all rows are sent across the network anyway, Impala could just use
> > >all the nodes in the cluster. The upside would be better parallelism for
> > >the join itself as well as for all the operators sitting on top of it.
> > >there a downside I'm forgetting?
> > >If not, is there a jira tracking this already? Haven't found one.
> > >
> > >Thanks!