Hello,

In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to
false. There is no record of why it was set to false, and I would like to
understand the background for the decision. Specifically I wonder if the
following situation is relevant to the decision.

Example)
MapJoinOp_1 joins: table G, table A, table B, table C
MapJoinOp_2 joins: table G, table A, table B              , table D

Here, table G is a big table to be read via shuffling.
MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D.
SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and
MapJoinOp_2 (because of table G and table A), so that both operators can
share in-memory tables.

Assume that MapJoinOp_1 is executed first and fills the cache first. Then,
MapJoinOp_2 does not load the cache which is already filled. As a result,
it ends up with something like NullPointerException.

After setting hive.merge.nway.joins to true, I encountered a problem (which
is not easy to reproduce), and I wonder if the above scenario is feasible
in the current implementation.

Many thanks,

--- Sungwoo






[1] https://issues.apache.org/jira/browse/HIVE-21189

Reply via email to