Harsh, could you explain briefly why is 1M setting for xceiver is bad? the
job is working now ...
about the ulimit -u it shows  200703, so is that why connection is reset by
peer? How come it's working with the xceiver modification?

Thanks,
Mark


On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <[email protected]> wrote:

> Agree with Raj V here - Your problem should not be the # of transfer
> threads nor the number of open files given that stacktrace.
>
> And the values you've set for the transfer threads are far beyond
> recommendations of 4k/8k - I would not recommend doing that. Default
> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
> when noticing increased HDFS load, or when running services like
> HBase.
>
> You should instead focus on why its this particular job (or even
> particular task, which is important to notice) that fails, and not
> other jobs (or other task attempts).
>
> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <[email protected]> wrote:
> > Mark
> >
> > You have this "Connection reset by peer". Why do you think this problem
> is related to too many open files?
> >
> > Raj
> >
> >
> >
> >>________________________________
> >> From: Mark question <[email protected]>
> >>To: [email protected]
> >>Sent: Thursday, January 26, 2012 11:10 AM
> >>Subject: Re: Too many open files Error
> >>
> >>Hi again,
> >>I've tried :
> >>     <property>
> >>        <name>dfs.datanode.max.xcievers</name>
> >>        <value>1048576</value>
> >>      </property>
> >>but I'm still getting the same error ... how high can I go??
> >>
> >>Thanks,
> >>Mark
> >>
> >>
> >>
> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <[email protected]>
> wrote:
> >>
> >>> Thanks for the reply.... I have nothing about
> dfs.datanode.max.xceivers on
> >>> my hdfs-site.xml so hopefully this would solve the problem and about
> the
> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
> Hadoop
> >>> with a single bin/start-all.sh ... Do you think I can add it by
> >>> bin/Datanode -ulimit n ?
> >>>
> >>> Mark
> >>>
> >>>
> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <[email protected]
> >wrote:
> >>>
> >>>> U need to set ulimit -n <bigger value> on datanode and restart
> datanodes.
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <[email protected]> wrote:
> >>>>
> >>>> > Hi Mark,
> >>>> >
> >>>> > On a lighter note what is the count of xceivers?
> >>>> dfs.datanode.max.xceivers
> >>>> > property in hdfs-site.xml?
> >>>> >
> >>>> > Thanks,
> >>>> > -idris
> >>>> >
> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
> >>>> [email protected]>wrote:
> >>>> >
> >>>> >> Sorry going from memory...
> >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a
> ulimit
> >>>> -a?
> >>>> >> That should give you the number of open files allowed by a single
> >>>> user...
> >>>> >>
> >>>> >>
> >>>> >> Sent from a remote device. Please excuse any typos...
> >>>> >>
> >>>> >> Mike Segel
> >>>> >>
> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <[email protected]>
> >>>> wrote:
> >>>> >>
> >>>> >>> Hi guys,
> >>>> >>>
> >>>> >>>  I get this error from a job trying to process 3Million records.
> >>>> >>>
> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink
> >>>> >> 192.168.1.20:50010
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
> >>>> >>>
> >>>> >>> When I checked the logfile of the datanode-20, I see :
> >>>> >>>
> >>>> >>> 2012-01-26 03:00:11,827 ERROR
> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> >>>> >>> 192.168.1.20:50010,
> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
> >>>> >>> java.io.IOException: Connection reset by peer
> >>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
> >>>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> >>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> >>>> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> >>>> >>>   at
> >>>> >>>
> >>>>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >>>> >>>   at
> >>>> >>>
> >>>>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >>>> >>>   at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>>> >>>   at
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> >>>> >>>   at java.lang.Thread.run(Thread.java:662)
> >>>> >>>
> >>>> >>>
> >>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 node
> >>>> >> cluster,
> >>>> >>> each map opens about 300 files so that should give 6000 opened
> files
> >>>> at
> >>>> >> the
> >>>> >>> same time ... why is this a problem? the maximum # of files per
> >>>> process
> >>>> >> on
> >>>> >>> one machine is:
> >>>> >>>
> >>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
> >>>> >>>
> >>>> >>>
> >>>> >>> Any suggestions?
> >>>> >>>
> >>>> >>> Thanks,
> >>>> >>> Mark
> >>>> >>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
>
>
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera
>

Reply via email to