Re: MultipleTextOutputFormat splitting output into different directories.

Aviad sela Tue, 15 Sep 2009 03:53:57 -0700

Is any body interested ,addressed such probelm.
Or does it seem to be esoteric usage ?





On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela <sela.s...@gmail.com> wrote:

>  I am using Hadoop 0.19.1
>
> I attempt to split an input into multiple directories.
> I don't know in advance how many directories exists.
> I don't know in advance what is the directory depth.
> I expect that under each such directory a file exists with all availble
> records having the same key permutation
> found in the job.
>
> If currently each reducer produce a single output i.e. PART-0001
> I would like to create as many directory necessary taking the following
> pattern:
>
>                key1 / key2/ .../ keyN/ PART-0001
>
> where the  "key?"  may have different values for each input record.
> different record may results with a different path requested:
>               key1a/key2b/PART-0001
>               key1c/key2d/key3e/PART-0001
> to keep it simple, during each job we may expect the same depth from each
> record.
>
> I assume that the input records imply that each reduce will produce several
> hundreds of such directories.
> (Indeed this strongly depends on the input record semantic).
>
>
> The MAP part reads a record,following some logic, assign a key like :
> KEY_A, KEY_B
> The MAP Value is the original input line.
>
>
> For The reducer part I assign the IdentityReducer.
> However have set :
>
>     jobConf.setReducerClass(IdentityReducer.
> *class*);
>
>     jobConf.setOutputFormat(MyTextOutputFormat.*class*);
>
>
>
> Where the MyTextOutputFormat  extends MultipleTextOutputFormat, and
> implements:
>
>     protected String generateFileNameForKeyValue(K key, V value, String
> name)
>     {
>         String keyParts[] = key.toString().split(",");
>         Path finalPath = null;
>         // Build the directory structure comprised of the Key parts
>        for (int i=0; i < keyParts.length; i++)
>        {
>             String part = keyParts[i].trim();
>            if  (false == "".equals(part))
>            {
>                if (null == finalPath)
>                           finalPath = new Path(part);
>                 else
>                 {
>                         finalPath = new Path(finalPath, part);
>                 }
>            }
>         } //end of for
>
>        String fileName = generateLeafFileName(name);
>        finalPath = new Path(finalPath, fileName);
>
>        return finalPath.toString();
>  } //generatedFileNameKeyValue
> During execution I have seen the reduce attempts does create the following
> path under the output path:
>
> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000000_0/KEY_A/KEY_B/part-00000"
>    However, the file was empty.
>
>
>
> The job fails at the end with the the following exceptions found in the
> task log:
>
> 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 9.148.30.71:50010
> 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-6138647338595590910_39383
> 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes ==
> null
> 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not
> get block locations. Source file
> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000002_0/KEY_A/KEY_B/part-00002"
> - Aborting...
> 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
> cleanup for the task
>
>
> The Command line also writes:
>
> 09/09/09 11:24:06 INFO mapred.JobClient: Task Id :
> attempt_200909080349_0013_r_000003_2, Status : FAILED
> java.io.IOException: Bad connect ack with firstBadLink 9.148.30.80:50010
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
>
>
> Any Ideas how to support such a scenario
>

Re: MultipleTextOutputFormat splitting output into different directories.

Reply via email to