Hyunsik Choi created TAJO-987:
---------------------------------
Summary: Hash shuffle should be balanced according to intermediate
volumes
Key: TAJO-987
URL: https://issues.apache.org/jira/browse/TAJO-987
Project: Tajo
Issue Type: Bug
Components: data shuffle
Reporter: Hyunsik Choi
Assignee: Hyunsik Choi
Fix For: 0.9.0
It is not hard to see skewed data set in practice. Currently, hash shuffled
intermediate are performed by distributing partition keys without considering
their partition volumes. As a result, with skewed intermediate data, a few of
nodes are likely to take much longer time than most of all nodes. It can cause
performance degradation. We need some solution to mitigate this problem.
This patch assigns the intermediate data by balancing their volumes. The
approach is a kind of greedy algorithm. In many cases, the shuffle num can be
over tens of thousands. I also considered the computation complexity. Its
complexity is O (n). It will show reasonable performance and balanced results.
--
This message was sent by Atlassian JIRA
(v6.2#6252)