Is any body interested ,addressed such probelm. Or does it seem to be esoteric usage ?
On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela <sela.s...@gmail.com> wrote: > I am using Hadoop 0.19.1 > > I attempt to split an input into multiple directories. > I don't know in advance how many directories exists. > I don't know in advance what is the directory depth. > I expect that under each such directory a file exists with all availble > records having the same key permutation > found in the job. > > If currently each reducer produce a single output i.e. PART-0001 > I would like to create as many directory necessary taking the following > pattern: > > key1 / key2/ .../ keyN/ PART-0001 > > where the "key?" may have different values for each input record. > different record may results with a different path requested: > key1a/key2b/PART-0001 > key1c/key2d/key3e/PART-0001 > to keep it simple, during each job we may expect the same depth from each > record. > > I assume that the input records imply that each reduce will produce several > hundreds of such directories. > (Indeed this strongly depends on the input record semantic). > > > The MAP part reads a record,following some logic, assign a key like : > KEY_A, KEY_B > The MAP Value is the original input line. > > > For The reducer part I assign the IdentityReducer. > However have set : > > jobConf.setReducerClass(IdentityReducer. > *class*); > > jobConf.setOutputFormat(MyTextOutputFormat.*class*); > > > > Where the MyTextOutputFormat extends MultipleTextOutputFormat, and > implements: > > protected String generateFileNameForKeyValue(K key, V value, String > name) > { > String keyParts[] = key.toString().split(","); > Path finalPath = null; > // Build the directory structure comprised of the Key parts > for (int i=0; i < keyParts.length; i++) > { > String part = keyParts[i].trim(); > if (false == "".equals(part)) > { > if (null == finalPath) > finalPath = new Path(part); > else > { > finalPath = new Path(finalPath, part); > } > } > } //end of for > > String fileName = generateLeafFileName(name); > finalPath = new Path(finalPath, fileName); > > return finalPath.toString(); > } //generatedFileNameKeyValue > During execution I have seen the reduce attempts does create the following > path under the output path: > > "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000000_0/KEY_A/KEY_B/part-00000" > However, the file was empty. > > > > The job fails at the end with the the following exceptions found in the > task log: > > 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream java.io.IOException: Bad connect ack with > firstBadLink 9.148.30.71:50010 > 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-6138647338595590910_39383 > 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes == > null > 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not > get block locations. Source file > "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000002_0/KEY_A/KEY_B/part-00002" > - Aborting... > 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error > running child > java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning > cleanup for the task > > > The Command line also writes: > > 09/09/09 11:24:06 INFO mapred.JobClient: Task Id : > attempt_200909080349_0013_r_000003_2, Status : FAILED > java.io.IOException: Bad connect ack with firstBadLink 9.148.30.80:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > > > Any Ideas how to support such a scenario >