Harsh, could you explain briefly why is 1M setting for xceiver is bad? the job is working now ... about the ulimit -u it shows 200703, so is that why connection is reset by peer? How come it's working with the xceiver modification?
Thanks, Mark On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <[email protected]> wrote: > Agree with Raj V here - Your problem should not be the # of transfer > threads nor the number of open files given that stacktrace. > > And the values you've set for the transfer threads are far beyond > recommendations of 4k/8k - I would not recommend doing that. Default > in 1.0.0 is 256 but set it to 2048/4096, which are good value to have > when noticing increased HDFS load, or when running services like > HBase. > > You should instead focus on why its this particular job (or even > particular task, which is important to notice) that fails, and not > other jobs (or other task attempts). > > On Fri, Jan 27, 2012 at 1:10 AM, Raj V <[email protected]> wrote: > > Mark > > > > You have this "Connection reset by peer". Why do you think this problem > is related to too many open files? > > > > Raj > > > > > > > >>________________________________ > >> From: Mark question <[email protected]> > >>To: [email protected] > >>Sent: Thursday, January 26, 2012 11:10 AM > >>Subject: Re: Too many open files Error > >> > >>Hi again, > >>I've tried : > >> <property> > >> <name>dfs.datanode.max.xcievers</name> > >> <value>1048576</value> > >> </property> > >>but I'm still getting the same error ... how high can I go?? > >> > >>Thanks, > >>Mark > >> > >> > >> > >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <[email protected]> > wrote: > >> > >>> Thanks for the reply.... I have nothing about > dfs.datanode.max.xceivers on > >>> my hdfs-site.xml so hopefully this would solve the problem and about > the > >>> ulimit -n , I'm running on an NFS cluster, so usually I just start > Hadoop > >>> with a single bin/start-all.sh ... Do you think I can add it by > >>> bin/Datanode -ulimit n ? > >>> > >>> Mark > >>> > >>> > >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <[email protected] > >wrote: > >>> > >>>> U need to set ulimit -n <bigger value> on datanode and restart > datanodes. > >>>> > >>>> Sent from my iPhone > >>>> > >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <[email protected]> wrote: > >>>> > >>>> > Hi Mark, > >>>> > > >>>> > On a lighter note what is the count of xceivers? > >>>> dfs.datanode.max.xceivers > >>>> > property in hdfs-site.xml? > >>>> > > >>>> > Thanks, > >>>> > -idris > >>>> > > >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel < > >>>> [email protected]>wrote: > >>>> > > >>>> >> Sorry going from memory... > >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a > ulimit > >>>> -a? > >>>> >> That should give you the number of open files allowed by a single > >>>> user... > >>>> >> > >>>> >> > >>>> >> Sent from a remote device. Please excuse any typos... > >>>> >> > >>>> >> Mike Segel > >>>> >> > >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <[email protected]> > >>>> wrote: > >>>> >> > >>>> >>> Hi guys, > >>>> >>> > >>>> >>> I get this error from a job trying to process 3Million records. > >>>> >>> > >>>> >>> java.io.IOException: Bad connect ack with firstBadLink > >>>> >> 192.168.1.20:50010 > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) > >>>> >>> > >>>> >>> When I checked the logfile of the datanode-20, I see : > >>>> >>> > >>>> >>> 2012-01-26 03:00:11,827 ERROR > >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration( > >>>> >>> 192.168.1.20:50010, > >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369, > >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver > >>>> >>> java.io.IOException: Connection reset by peer > >>>> >>> at sun.nio.ch.FileDispatcher.read0(Native Method) > >>>> >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > >>>> >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > >>>> >>> at sun.nio.ch.IOUtil.read(IOUtil.java:175) > >>>> >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > >>>> >>> at > >>>> >>> > >>>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >>>> >>> at > >>>> >>> > >>>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >>>> >>> at > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >>>> >>> at > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >>>> >>> at java.io.DataInputStream.read(DataInputStream.java:132) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) > >>>> >>> at > >>>> >>> > >>>> >> > >>>> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) > >>>> >>> at java.lang.Thread.run(Thread.java:662) > >>>> >>> > >>>> >>> > >>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 node > >>>> >> cluster, > >>>> >>> each map opens about 300 files so that should give 6000 opened > files > >>>> at > >>>> >> the > >>>> >>> same time ... why is this a problem? the maximum # of files per > >>>> process > >>>> >> on > >>>> >>> one machine is: > >>>> >>> > >>>> >>> cat /proc/sys/fs/file-max ---> 2403545 > >>>> >>> > >>>> >>> > >>>> >>> Any suggestions? > >>>> >>> > >>>> >>> Thanks, > >>>> >>> Mark > >>>> >> > >>>> > >>> > >>> > >> > >> > >> > > > > -- > Harsh J > Customer Ops. Engineer, Cloudera >
