[
https://issues.apache.org/jira/browse/HAMA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461600#comment-13461600
]
Yuesheng Hu commented on HAMA-647:
----------------------------------
For example, a cluster with 100 nodes, every node has 20 GB RAM, storage is
sufficient. Suppose there are 20 tasks per node (1024MB/task), the task
capacity of this cluster is 2000. I think tht cluster is powerful enough to
handle 1TB input. Also suppose the HDFS block size is 64MB, we use all task
slots to handle this 1TB input.
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)) this is what the
computeSplitSize in FileInputFormat doing. Normally, the
minSize("bsp.min.split.size") is set to be 1. So, the splitSize is the minimun
between goalSize and blockSize. Let's make a caculation:
goalSize = totalSize/numSplits = 1TB/2000 = 524 MB(integer truncation, goalSize
actually is 524.288), it is larger than blockSize(64MB), so the splitSize will
be blockSize(64MB). This will split the input into at least 16384 splits that
is much larger than the tasks capacity(2000), it will cause the job failed
except the number of input files is equal to the numTasks.
The solution is :
1. long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits - 1) because
integer truncation, we shoud divided totalSize by numSplit - 1 so that it can
compensate the loss of accurary.
2. protected long computeSplitSize(long goalSize, long minSize, long
blockSize) {
if (goalSize > blockSize) {
return Math.max(minSize, Math.max(goalSize, blockSize));
} else {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
This will always return the goalSize as splitSize if we ignore the minSize, but
it can guarantee that the number of large input's splits is not beyond the
numTasks, and the small input will also get a good parallelism degree (some
program need more than 1 task to exchange message, like k-means).
> Make the input spliter robustly
> --------------------------------
>
> Key: HAMA-647
> URL: https://issues.apache.org/jira/browse/HAMA-647
> Project: Hama
> Issue Type: Improvement
> Components: bsp core
> Affects Versions: 0.5.0, 0.6.0
> Reporter: Yuesheng Hu
> Assignee: Yuesheng Hu
> Priority: Critical
> Fix For: 0.6.0
>
>
> Currently, the spliter in FileInputFormat is based on the Mapreduce's
> spliter. But, Hama is different from Mapreduce, Hama's task can not be
> pended until the slot becomes free. So, the current spliter is not suitable
> for Hama. When input file is small, it may be ok, but when input is very
> large, the number of splits will be very large too, even our cluster is
> powerful enough to handle the input. More details, please see the comments.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira