[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Thomas Jungblut (JIRA) Tue, 11 Dec 2012 01:07:26 -0800

    [ 
https://issues.apache.org/jira/browse/HAMA-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528827#comment-13528827
 ]


Thomas Jungblut commented on HAMA-531:
--------------------------------------

Okay Edward, you haven't understood how it works.

Maybe you know it better when I show you some code.
bq.Mindist search example

If you define in the job:
{noformat}
    job.setInputKeyClass(Text.class);
    job.setInputValueClass(TextArrayWritable.class);
    job.setInputFormat(SequenceFileInputFormat.class);
    job.setVertexInputReaderClass(MindistSearchSequenceFilereader.class);
{noformat}

And you define your reader like this:

{noformat}
public static class MindistSearchCountReader extends
  VertexInputReader<Text, TextArrayWritable, Text, NullWritable, Text> {

 @Override
 public boolean parseVertex(Text key, TextArrayWritable value,
                Vertex<Text, NullWritable, Text> vertex) {
        vertex.setVertexID(key);
        for(Text edgeName : value.get()){
                vertex.addEdge(new Edge<Text, NullWritable>(new Text(edgeName), 
null));
        }
        return true;
 } 

}
{noformat}

Then you can support your binary sequencefile format as well as well as all 
other damn formats that exists in the whole world.

If you want to have a binary sequencefile format, then do this. But I will quit 
committing to Hama then, because I'm not going to support ONLY a binary format. 
This is not what I build a framework for.

What if you want to change this binary format? Do you want to recreate every 
file on the whole planet? You must take care about versioning then, and that is 
because we need a proxy between the inputformat and our vertex API.
                
> Data re-partitioning in BSPJobClient
> ------------------------------------
>
>                 Key: HAMA-531
>                 URL: https://issues.apache.org/jira/browse/HAMA-531
>             Project: Hama
>          Issue Type: Improvement
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.1
>
>         Attachments: HAMA-531_1.patch, HAMA-531_2.patch, 
> HAMA-531_final.patch, patch.txt, patch_v02.txt, patch_v03.txt, patch_v04.txt
>
>
> The re-partitioning the data is a very expensive operation. By the way, 
> currently, we processes read/write operations sequentially using HDFS api in 
> BSPJobClient from client-side. This causes potential too many open files 
> error, contains HDFS overheads, and shows slow performance.
> We have to find another way to re-partitioning data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-531) Data re-partitioning in BSPJobClient

Reply via email to