Hyunsik Choi created TAJO-987:
---------------------------------

             Summary: Hash shuffle should be balanced according to intermediate 
volumes
                 Key: TAJO-987
                 URL: https://issues.apache.org/jira/browse/TAJO-987
             Project: Tajo
          Issue Type: Bug
          Components: data shuffle
            Reporter: Hyunsik Choi
            Assignee: Hyunsik Choi
             Fix For: 0.9.0


It is not hard to see skewed data set in practice. Currently, hash shuffled 
intermediate are performed by distributing partition keys without considering 
their partition volumes. As a result, with skewed intermediate data, a few of 
nodes are likely to take much longer time than most of all nodes. It can cause 
performance degradation. We need some solution to mitigate this problem.

This patch assigns the intermediate data by balancing their volumes. The 
approach is a kind of greedy algorithm. In many cases, the shuffle num can be 
over tens of thousands. I also considered the computation complexity. Its 
complexity is O (n). It will show reasonable performance and balanced results.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to