[jira] [Commented] (HAMA-700) BSPPartitioner should be configurable to be optional and allow input format conversion

Apurv Verma (JIRA) Mon, 14 Jan 2013 19:52:21 -0800

    [ 
https://issues.apache.org/jira/browse/HAMA-700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553465#comment-13553465
 ]


Apurv Verma commented on HAMA-700:
----------------------------------

Hey,
 Although this has been committed, but here is my review. I also have a few 
questions.

1)Why did you choose to introduce getPartitionId() method in 
PartitioningRunner.RecordConverter. Doesn't it do exactly what Partitioner's 
getPartition does? Can we avoid this method here. 
  In the current implementation Partitioner's partition method just gets the 
input key and value and numTasks, based on which it has to decide to the output 
partition. If you argue that partitioner might need some additional information 
to decide the partition, as has been used in your patch at some places, I would 
say we should implement the Partitioner by making it JobConfigurable. See 
Partitioner implementation of hadoop to get a feel of the implementation.

2) I abide by these simple rules for the partitioning runner.

   a) PartiotionerRunner job executes before the actual bsp job gets executed.
   b) Its role is to partition the input and now additionally to convert the 
input records to a record format   that will be fed into the main bsp job.
   
   So abiding by these rules, earlier this used to happen for graph jobs.

   input records were partitioned by PartitioniningRunner using HashPartitioner.
   these records were fed into GraphJob which parsed the records into Vertices.
     

   wherease now, with your implementation this happens.
   input records get parsed into vertices
   vertices are fed as input to GraphJob.
   
   We should maintain consistency. Let's have record partitioning as a separate 
feature and graph partitioning as separate. In a way I can see you have tried 
to use record partitioning for graph partitioning also. Do I understand it 
properly?
   
   So what I suggest is as follows. Although you have already implemented it 
and its just a suggestion. Sorry for being a pain here.
   input records ---(x)---> intermediate records --(y)--> vertex

   (x) Will be done by partitioning runner. This will also do record conversion.
   (y) Will be done by VertexInputReader. (IMO we shouldn't make 
VertexInputReader implement RecordConverter)




(3) abstract V createVertexIDObject(); 
    abstract E createEdgeCostObject();
    abstract M createVertexValue();
    Can't we use reflection to avoid these methods. Anyways user is specifying 
the VertexID.class, Edge.class and also their types. So we should be able to 
create them via reflection instead of delegating this responsibility to each 
implementing class everytime.




(4) if(outKeyClass == null && outputValueClass == null){
       outputKeyClass = //initialize here
       //...
    }
    This is a nice trick but IMO it would be good to make RecordConverter 
generic and also specify the .class for these in the bsp job configuration 
parameters. See line 148 PartitioningRunner.


(5) Line 415, BSPJobClient
    
        if (numTasks == 0) {
          numTasks = numSplits;
        }

   This numTasks is actually the numTasks for the PartitioningRunner job, do 
you intend to do it for the main job or the partitioning job that will execute 
prior to the main bsp job? 
                
> BSPPartitioner should be configurable to be optional and allow input format 
> conversion
> --------------------------------------------------------------------------------------
>
>                 Key: HAMA-700
>                 URL: https://issues.apache.org/jira/browse/HAMA-700
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp core
>            Reporter: Suraj Menon
>            Assignee: Suraj Menon
>             Fix For: 0.6.1
>
>         Attachments: HAMA-700.patch_Jan7, HAMA-700.patch.v2, HAMA-700-v1.patch
>
>
> There should be a provisioning for skipping the PartitionRunner if needed. 
> Also we can have a RecordConverter interface so that the PartitionRunner can 
> read the input in any format and create new splits. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-700) BSPPartitioner should be configurable to be optional and allow input format conversion

Reply via email to