[
https://issues.apache.org/jira/browse/TAJO-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi updated TAJO-987:
------------------------------
Description:
It is not hard to see skewed data set in practice. Currently, hash shuffled
intermediate are performed by distributing partition keys without considering
their partition volumes. As a result, with skewed intermediate data, a few of
nodes are likely to take much longer time than most of all nodes. It can cause
performance degradation. We need some solution to mitigate this problem.
This patch assigns the intermediate data by balancing their volumes. The
approach is a kind of greedy algorithm. In many cases, the shuffle num can be
over tens of thousands. I also considered the computation complexity. Its
complexity is O \(n\). It will show reasonable performance and balanced results.
was:
It is not hard to see skewed data set in practice. Currently, hash shuffled
intermediate are performed by distributing partition keys without considering
their partition volumes. As a result, with skewed intermediate data, a few of
nodes are likely to take much longer time than most of all nodes. It can cause
performance degradation. We need some solution to mitigate this problem.
This patch assigns the intermediate data by balancing their volumes. The
approach is a kind of greedy algorithm. In many cases, the shuffle num can be
over tens of thousands. I also considered the computation complexity. Its
complexity is O (n). It will show reasonable performance and balanced results.
> Hash shuffle should be balanced according to intermediate volumes
> -----------------------------------------------------------------
>
> Key: TAJO-987
> URL: https://issues.apache.org/jira/browse/TAJO-987
> Project: Tajo
> Issue Type: Bug
> Components: data shuffle
> Reporter: Hyunsik Choi
> Assignee: Hyunsik Choi
> Fix For: 0.9.0
>
>
> It is not hard to see skewed data set in practice. Currently, hash shuffled
> intermediate are performed by distributing partition keys without considering
> their partition volumes. As a result, with skewed intermediate data, a few of
> nodes are likely to take much longer time than most of all nodes. It can
> cause performance degradation. We need some solution to mitigate this problem.
> This patch assigns the intermediate data by balancing their volumes. The
> approach is a kind of greedy algorithm. In many cases, the shuffle num can be
> over tens of thousands. I also considered the computation complexity. Its
> complexity is O \(n\). It will show reasonable performance and balanced
> results.
--
This message was sent by Atlassian JIRA
(v6.2#6252)