Hi Harsh and Idris ... so the only drawback for increasing the value of xcievers is memory issue? In that case then I'll set it to a value that doesn't fill the memory ... Thanks, Mark
On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali <[email protected]> wrote: > Hi Mark, > > As Harsh pointed out it is not good idea to increase the Xceiver count to > arbitrarily higher value, I suggested to increase the xceiver count just to > unblock execution of your program temporarily. > > Thanks, > -Idris > > On Fri, Jan 27, 2012 at 10:39 AM, Harsh J <[email protected]> wrote: > > > You are technically allowing DN to run 1 million block transfer > > (in/out) threads by doing that. It does not take up resources by > > default sure, but now it can be abused with requests to make your DN > > run out of memory and crash cause its not bound to proper limits now. > > > > On Fri, Jan 27, 2012 at 5:49 AM, Mark question <[email protected]> > > wrote: > > > Harsh, could you explain briefly why is 1M setting for xceiver is bad? > > the > > > job is working now ... > > > about the ulimit -u it shows 200703, so is that why connection is > reset > > by > > > peer? How come it's working with the xceiver modification? > > > > > > Thanks, > > > Mark > > > > > > > > > On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <[email protected]> wrote: > > > > > >> Agree with Raj V here - Your problem should not be the # of transfer > > >> threads nor the number of open files given that stacktrace. > > >> > > >> And the values you've set for the transfer threads are far beyond > > >> recommendations of 4k/8k - I would not recommend doing that. Default > > >> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have > > >> when noticing increased HDFS load, or when running services like > > >> HBase. > > >> > > >> You should instead focus on why its this particular job (or even > > >> particular task, which is important to notice) that fails, and not > > >> other jobs (or other task attempts). > > >> > > >> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <[email protected]> wrote: > > >> > Mark > > >> > > > >> > You have this "Connection reset by peer". Why do you think this > > problem > > >> is related to too many open files? > > >> > > > >> > Raj > > >> > > > >> > > > >> > > > >> >>________________________________ > > >> >> From: Mark question <[email protected]> > > >> >>To: [email protected] > > >> >>Sent: Thursday, January 26, 2012 11:10 AM > > >> >>Subject: Re: Too many open files Error > > >> >> > > >> >>Hi again, > > >> >>I've tried : > > >> >> <property> > > >> >> <name>dfs.datanode.max.xcievers</name> > > >> >> <value>1048576</value> > > >> >> </property> > > >> >>but I'm still getting the same error ... how high can I go?? > > >> >> > > >> >>Thanks, > > >> >>Mark > > >> >> > > >> >> > > >> >> > > >> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <[email protected] > > > > >> wrote: > > >> >> > > >> >>> Thanks for the reply.... I have nothing about > > >> dfs.datanode.max.xceivers on > > >> >>> my hdfs-site.xml so hopefully this would solve the problem and > about > > >> the > > >> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start > > >> Hadoop > > >> >>> with a single bin/start-all.sh ... Do you think I can add it by > > >> >>> bin/Datanode -ulimit n ? > > >> >>> > > >> >>> Mark > > >> >>> > > >> >>> > > >> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn < > > [email protected] > > >> >wrote: > > >> >>> > > >> >>>> U need to set ulimit -n <bigger value> on datanode and restart > > >> datanodes. > > >> >>>> > > >> >>>> Sent from my iPhone > > >> >>>> > > >> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <[email protected]> > > wrote: > > >> >>>> > > >> >>>> > Hi Mark, > > >> >>>> > > > >> >>>> > On a lighter note what is the count of xceivers? > > >> >>>> dfs.datanode.max.xceivers > > >> >>>> > property in hdfs-site.xml? > > >> >>>> > > > >> >>>> > Thanks, > > >> >>>> > -idris > > >> >>>> > > > >> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel < > > >> >>>> [email protected]>wrote: > > >> >>>> > > > >> >>>> >> Sorry going from memory... > > >> >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a > > >> ulimit > > >> >>>> -a? > > >> >>>> >> That should give you the number of open files allowed by a > > single > > >> >>>> user... > > >> >>>> >> > > >> >>>> >> > > >> >>>> >> Sent from a remote device. Please excuse any typos... > > >> >>>> >> > > >> >>>> >> Mike Segel > > >> >>>> >> > > >> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question < > [email protected] > > > > > >> >>>> wrote: > > >> >>>> >> > > >> >>>> >>> Hi guys, > > >> >>>> >>> > > >> >>>> >>> I get this error from a job trying to process 3Million > > records. > > >> >>>> >>> > > >> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink > > >> >>>> >> 192.168.1.20:50010 > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) > > >> >>>> >>> > > >> >>>> >>> When I checked the logfile of the datanode-20, I see : > > >> >>>> >>> > > >> >>>> >>> 2012-01-26 03:00:11,827 ERROR > > >> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: > > >> DatanodeRegistration( > > >> >>>> >>> 192.168.1.20:50010, > > >> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369, > > >> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver > > >> >>>> >>> java.io.IOException: Connection reset by peer > > >> >>>> >>> at sun.nio.ch.FileDispatcher.read0(Native Method) > > >> >>>> >>> at > sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > > >> >>>> >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > > >> >>>> >>> at sun.nio.ch.IOUtil.read(IOUtil.java:175) > > >> >>>> >>> at > > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> > > >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> > > >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > > >> >>>> >>> at > > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > >> >>>> >>> at > > >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > >> >>>> >>> at java.io.DataInputStream.read(DataInputStream.java:132) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) > > >> >>>> >>> at > > >> >>>> >>> > > >> >>>> >> > > >> >>>> > > >> > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) > > >> >>>> >>> at java.lang.Thread.run(Thread.java:662) > > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 > > node > > >> >>>> >> cluster, > > >> >>>> >>> each map opens about 300 files so that should give 6000 > opened > > >> files > > >> >>>> at > > >> >>>> >> the > > >> >>>> >>> same time ... why is this a problem? the maximum # of files > per > > >> >>>> process > > >> >>>> >> on > > >> >>>> >>> one machine is: > > >> >>>> >>> > > >> >>>> >>> cat /proc/sys/fs/file-max ---> 2403545 > > >> >>>> >>> > > >> >>>> >>> > > >> >>>> >>> Any suggestions? > > >> >>>> >>> > > >> >>>> >>> Thanks, > > >> >>>> >>> Mark > > >> >>>> >> > > >> >>>> > > >> >>> > > >> >>> > > >> >> > > >> >> > > >> >> > > >> > > >> > > >> > > >> -- > > >> Harsh J > > >> Customer Ops. Engineer, Cloudera > > >> > > > > > > > > -- > > Harsh J > > Customer Ops. Engineer, Cloudera > > >
