Partitioning should be done in parallel
---------------------------------------

                 Key: HAMA-560
                 URL: https://issues.apache.org/jira/browse/HAMA-560
             Project: Hama
          Issue Type: Improvement
          Components: bsp
    Affects Versions: 0.4.0
            Reporter: praveen sripati


Currently partitioning happens in the node on which the job has been submitted 
in the BSPJobClient#submitJobInternal(). The partitioning happens in sequence 
and this will be a bottle neck as the input data size grows. With partitioning 
in parallel, the completion time for the job also 

Here are some of the options to evaluate

- Multiple threads to do the partitioning in the BSPJobClient#partition(). This 
is an easy fix, but the partitioning is still restricted to a single node. 
There might be problem with simultanious writes to the same file.

- Use MR to partition the data. To check if we can kick an MR job with 
BSPJobClient#partition() to partition the input data. The # of reducers should 
be set to the # of bsp tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to