Re: Too many open files Error

Harsh J Thu, 26 Jan 2012 21:11:11 -0800

You are technically allowing DN to run 1 million block transfer
(in/out) threads by doing that. It does not take up resources by
default sure, but now it can be abused with requests to make your DN
run out of memory and crash cause its not bound to proper limits now.


On Fri, Jan 27, 2012 at 5:49 AM, Mark question <[email protected]> wrote:
> Harsh, could you explain briefly why is 1M setting for xceiver is bad? the
> job is working now ...
> about the ulimit -u it shows  200703, so is that why connection is reset by
> peer? How come it's working with the xceiver modification?
>
> Thanks,
> Mark
>
>
> On Thu, Jan 26, 2012 at 12:21 PM, Harsh J <[email protected]> wrote:
>
>> Agree with Raj V here - Your problem should not be the # of transfer
>> threads nor the number of open files given that stacktrace.
>>
>> And the values you've set for the transfer threads are far beyond
>> recommendations of 4k/8k - I would not recommend doing that. Default
>> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
>> when noticing increased HDFS load, or when running services like
>> HBase.
>>
>> You should instead focus on why its this particular job (or even
>> particular task, which is important to notice) that fails, and not
>> other jobs (or other task attempts).
>>
>> On Fri, Jan 27, 2012 at 1:10 AM, Raj V <[email protected]> wrote:
>> > Mark
>> >
>> > You have this "Connection reset by peer". Why do you think this problem
>> is related to too many open files?
>> >
>> > Raj
>> >
>> >
>> >
>> >>________________________________
>> >> From: Mark question <[email protected]>
>> >>To: [email protected]
>> >>Sent: Thursday, January 26, 2012 11:10 AM
>> >>Subject: Re: Too many open files Error
>> >>
>> >>Hi again,
>> >>I've tried :
>> >>     <property>
>> >>        <name>dfs.datanode.max.xcievers</name>
>> >>        <value>1048576</value>
>> >>      </property>
>> >>but I'm still getting the same error ... how high can I go??
>> >>
>> >>Thanks,
>> >>Mark
>> >>
>> >>
>> >>
>> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question <[email protected]>
>> wrote:
>> >>
>> >>> Thanks for the reply.... I have nothing about
>> dfs.datanode.max.xceivers on
>> >>> my hdfs-site.xml so hopefully this would solve the problem and about
>> the
>> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
>> Hadoop
>> >>> with a single bin/start-all.sh ... Do you think I can add it by
>> >>> bin/Datanode -ulimit n ?
>> >>>
>> >>> Mark
>> >>>
>> >>>
>> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <[email protected]
>> >wrote:
>> >>>
>> >>>> U need to set ulimit -n <bigger value> on datanode and restart
>> datanodes.
>> >>>>
>> >>>> Sent from my iPhone
>> >>>>
>> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali <[email protected]> wrote:
>> >>>>
>> >>>> > Hi Mark,
>> >>>> >
>> >>>> > On a lighter note what is the count of xceivers?
>> >>>> dfs.datanode.max.xceivers
>> >>>> > property in hdfs-site.xml?
>> >>>> >
>> >>>> > Thanks,
>> >>>> > -idris
>> >>>> >
>> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
>> >>>> [email protected]>wrote:
>> >>>> >
>> >>>> >> Sorry going from memory...
>> >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a
>> ulimit
>> >>>> -a?
>> >>>> >> That should give you the number of open files allowed by a single
>> >>>> user...
>> >>>> >>
>> >>>> >>
>> >>>> >> Sent from a remote device. Please excuse any typos...
>> >>>> >>
>> >>>> >> Mike Segel
>> >>>> >>
>> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question <[email protected]>
>> >>>> wrote:
>> >>>> >>
>> >>>> >>> Hi guys,
>> >>>> >>>
>> >>>> >>>  I get this error from a job trying to process 3Million records.
>> >>>> >>>
>> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink
>> >>>> >> 192.168.1.20:50010
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>> >>>> >>>
>> >>>> >>> When I checked the logfile of the datanode-20, I see :
>> >>>> >>>
>> >>>> >>> 2012-01-26 03:00:11,827 ERROR
>> >>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>> >>>> >>> 192.168.1.20:50010,
>> >>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
>> >>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
>> >>>> >>> java.io.IOException: Connection reset by peer
>> >>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>> >>>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> >>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>> >>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>> >>>> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>> >>>> >>>   at
>> >>>> >>>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >>>> >>>   at
>> >>>> >>>
>> >>>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >>>> >>>   at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >>>> >>>   at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>> >>>> >>>   at
>> >>>> >>>
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>> >>>> >>>   at java.lang.Thread.run(Thread.java:662)
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 node
>> >>>> >> cluster,
>> >>>> >>> each map opens about 300 files so that should give 6000 opened
>> files
>> >>>> at
>> >>>> >> the
>> >>>> >>> same time ... why is this a problem? the maximum # of files per
>> >>>> process
>> >>>> >> on
>> >>>> >>> one machine is:
>> >>>> >>>
>> >>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Any suggestions?
>> >>>> >>>
>> >>>> >>> Thanks,
>> >>>> >>> Mark
>> >>>> >>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>>



-- 
Harsh J
Customer Ops. Engineer, Cloudera

Re: Too many open files Error

Reply via email to