Re: MultipleTextOutputFormat splitting output into different directories.

Jingkei Ly Tue, 15 Sep 2009 04:37:14 -0700

I think I had similar sorts of errors trying to write via 100s of open
OutputStreams to HDFS - I presume that MultipleOutputFormat has to do the
same at each Reducer if you give it many output files to create, as in your
use case.
I think the problem stems from having many concurrent sockets open. You can
mitigate this problem by adding dfs.datanode.max.xcievers to hadoop-site.xml
and setting a higher value (if you check the DataNode logs you might find
these errors confirming it: java.io.IOException: xceiverCount 257 exceeds
the limit of concurrent xcievers 256). However, I think this setting carries
a memory overhead so you can't keep increasing the value indefinitely - is
there another way you can approach this without needing so many output files
per reducer?


2009/9/15 Aviad sela <sela.s...@gmail.com>

> Is any body interested ,addressed such probelm.
> Or does it seem to be esoteric usage ?
>
>
>
>
> On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela <sela.s...@gmail.com> wrote:
>
> >  I am using Hadoop 0.19.1
> >
> > I attempt to split an input into multiple directories.
> > I don't know in advance how many directories exists.
> > I don't know in advance what is the directory depth.
> > I expect that under each such directory a file exists with all availble
> > records having the same key permutation
> > found in the job.
> >
> > If currently each reducer produce a single output i.e. PART-0001
> > I would like to create as many directory necessary taking the following
> > pattern:
> >
> >                key1 / key2/ .../ keyN/ PART-0001
> >
> > where the  "key?"  may have different values for each input record.
> > different record may results with a different path requested:
> >               key1a/key2b/PART-0001
> >               key1c/key2d/key3e/PART-0001
> > to keep it simple, during each job we may expect the same depth from each
> > record.
> >
> > I assume that the input records imply that each reduce will produce
> several
> > hundreds of such directories.
> > (Indeed this strongly depends on the input record semantic).
> >
> >
> > The MAP part reads a record,following some logic, assign a key like :
> > KEY_A, KEY_B
> > The MAP Value is the original input line.
> >
> >
> > For The reducer part I assign the IdentityReducer.
> > However have set :
> >
> >     jobConf.setReducerClass(IdentityReducer.
> > *class*);
> >
> >     jobConf.setOutputFormat(MyTextOutputFormat.*class*);
> >
> >
> >
> > Where the MyTextOutputFormat  extends MultipleTextOutputFormat, and
> > implements:
> >
> >     protected String generateFileNameForKeyValue(K key, V value, String
> > name)
> >     {
> >         String keyParts[] = key.toString().split(",");
> >         Path finalPath = null;
> >         // Build the directory structure comprised of the Key parts
> >        for (int i=0; i < keyParts.length; i++)
> >        {
> >             String part = keyParts[i].trim();
> >            if  (false == "".equals(part))
> >            {
> >                if (null == finalPath)
> >                           finalPath = new Path(part);
> >                 else
> >                 {
> >                         finalPath = new Path(finalPath, part);
> >                 }
> >            }
> >         } //end of for
> >
> >        String fileName = generateLeafFileName(name);
> >        finalPath = new Path(finalPath, fileName);
> >
> >        return finalPath.toString();
> >  } //generatedFileNameKeyValue
> > During execution I have seen the reduce attempts does create the
> following
> > path under the output path:
> >
> >
> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000000_0/KEY_A/KEY_B/part-00000"
> >    However, the file was empty.
> >
> >
> >
> > The job fails at the end with the the following exceptions found in the
> > task log:
> >
> > 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream java.io.IOException: Bad connect ack with
> > firstBadLink 9.148.30.71:50010
> > 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_-6138647338595590910_39383
> > 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> > Exception: java.io.IOException: Unable to create new block.
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> > 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes
> ==
> > null
> > 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not
> > get block locations. Source file
> >
> "/user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_000002_0/KEY_A/KEY_B/part-00002"
> > - Aborting...
> > 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error
> > running child
> > java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner:
> Runnning
> > cleanup for the task
> >
> >
> > The Command line also writes:
> >
> > 09/09/09 11:24:06 INFO mapred.JobClient: Task Id :
> > attempt_200909080349_0013_r_000003_2, Status : FAILED
> > java.io.IOException: Bad connect ack with firstBadLink 9.148.30.80:50010
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> >
> >
> > Any Ideas how to support such a scenario
> >
>

Re: MultipleTextOutputFormat splitting output into different directories.

Reply via email to