[ 
https://issues.apache.org/jira/browse/HAMA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461600#comment-13461600
 ] 

Yuesheng Hu commented on HAMA-647:
----------------------------------

For example, a cluster with 100 nodes, every node has 20 GB RAM, storage is 
sufficient. Suppose there are 20 tasks per node (1024MB/task), the task 
capacity of this cluster is 2000. I think tht cluster is powerful enough to 
handle 1TB input.  Also suppose the HDFS block size is 64MB,  we use all task 
slots to handle this 1TB input. 

splitSize = Math.max(minSize, Math.min(goalSize, blockSize))  this is what the 
computeSplitSize in FileInputFormat doing. Normally, the 
minSize("bsp.min.split.size") is set to be 1. So, the splitSize is the minimun 
between goalSize and blockSize. Let's make a  caculation: 
goalSize = totalSize/numSplits = 1TB/2000 = 524 MB(integer truncation, goalSize 
actually is 524.288), it is larger than blockSize(64MB), so the splitSize will 
be blockSize(64MB). This will split the input into at least  16384 splits that 
is much larger than the tasks capacity(2000), it will cause the job failed 
except the number of input files is equal to the numTasks.


The solution is :
1. long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits - 1)   because 
integer truncation, we shoud divided totalSize by numSplit - 1 so that it can 
compensate the loss of accurary.
2.  protected long computeSplitSize(long goalSize, long minSize, long 
blockSize) {
    if (goalSize > blockSize) {
      return Math.max(minSize, Math.max(goalSize, blockSize));
    } else {
      return Math.max(minSize, Math.min(goalSize, blockSize));
    }
This will always return the goalSize as splitSize if we ignore the minSize, but 
it can guarantee that the number of large input's splits is not beyond the 
numTasks, and the small input will also get a good parallelism degree (some 
program need more than 1 task to exchange message, like k-means).
                
> Make the  input spliter robustly
> --------------------------------
>
>                 Key: HAMA-647
>                 URL: https://issues.apache.org/jira/browse/HAMA-647
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp core
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Yuesheng Hu
>            Assignee: Yuesheng Hu
>            Priority: Critical
>             Fix For: 0.6.0
>
>
> Currently, the spliter in FileInputFormat is based on the Mapreduce's 
> spliter. But, Hama is different from Mapreduce, Hama's task can not be  
> pended until the slot becomes free.  So, the current spliter is not suitable 
> for Hama. When input file is small, it may be ok, but when input is  very 
> large, the number of splits will be very large too, even our cluster is 
> powerful enough to handle the input. More details, please see the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to