> Can you check the DN logs for "exceeds the limit of concurrent > xcievers"? You may need to bump the dfs.datanode.max.xcievers > parameter in hdfs-site.xml, and also possibly the nfiles ulimit.
Thanks Todd, and sorry for the late reply - I missed this message. I didn't see any xciever messages in the DN logs, but I figured it might be a good idea to up the nofiles uplimit. The result is a jsvc that is eating memory: $top Mem: 16320412k total, 16199036k used, 121376k free, 25412k buffers Swap: 33554424k total, 291492k used, 33262932k free, 10966732k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24835 mapred 18 0 2644m 157m 8316 S 34.1 1.0 7031:27 java 14794 hdfs 18 0 2430m 1.5g 10m S 3.3 9.8 3:39.56 jsvc I'll revert it and see what effect dfs.datanode.max.xcievers will have. Cheers, Evert > > -Todd > > > On Wed, Mar 9, 2011 at 3:27 AM, Evert Lammerts <[email protected]> > wrote: > > We see a lot of IOExceptions coming from HDFS during a job that does > nothing but untar 100 files (1 per Mapper, sizes vary between 5GB and > 80GB) that are in HDFS, to HDFS. DataNodes are also showing Exceptions > that I think are related. (See stacktraces below.) > > > > This job should not be able to overload the system I think... I > realize that much data needs to go over the lines, but HDFS should > still be responsive. Any ideas / help is much appreciated! > > > > Some details: > > * Hadoop 0.20.2 (CDH3b4) > > * 5 node cluster plus 1 node for JT/NN (Sun Thumpers) > > * 4 cores/node, 4GB RAM/core > > * CentOS 5.5 > > > > Job output: > > > > java.io.IOException: java.io.IOException: Could not obtain block: > blk_-3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26- > SOCIAL_MEDIA.tar.gz > > at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:449) > > at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:1) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:240) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati > on.java:1115) > > at org.apache.hadoop.mapred.Child.main(Child.java:234) > > Caused by: java.io.IOException: Could not obtain block: blk_- > 3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26- > SOCIAL_MEDIA.tar.gz > > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClien > t.java:1977) > > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.j > ava:1784) > > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:193 > 2) > > at java.io.DataInputStream.read(DataInputStream.java:83) > > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55) > > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) > > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:335) > > at ilps.DownloadICWSM$CopyThread.run(DownloadICWSM.java:149) > > > > > > Example DataNode Exceptions (not that these come from the node at > 192.168.28.211): > > > > 2011-03-08 19:40:40,297 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-9222067946733189014_3798233 > java.io.EOFException: while trying to read 3067064 bytes > > 2011-03-08 19:40:41,018 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.28.211:50050, dest: /192.168.28.211:49748, bytes: 0, op: > HDFS_READ, cliID: DFSClient_attempt_201103071120_0030_m_000032_0, > offset: 30 > > 72, srvID: DS-568746059-145.100.2.180-50050-1291128670510, blockid: > blk_3596618013242149887_4060598, duration: 2632000 > > 2011-03-08 19:40:41,049 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-9221028436071074510_2325937 > java.io.EOFException: while trying to read 2206400 bytes > > 2011-03-08 19:40:41,348 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-9221549395563181322_4024529 > java.io.EOFException: while trying to read 3037288 bytes > > 2011-03-08 19:40:41,357 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-9221885906633018147_3895876 > java.io.EOFException: while trying to read 1981952 bytes > > 2011-03-08 19:40:41,434 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_- > 9221885906633018147_3895876 unfinalized and removed. > > 2011-03-08 19:40:41,434 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_- > 9221885906633018147_3895876 received exception java.io.EOFException: > while trying to read 1981952 bytes > > 2011-03-08 19:40:41,434 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.28.211:50050, storageID=DS-568746059- > 145.100.2.180-50050-1291128670510, infoPort=50075, > ipcPort=50020):DataXceiver > > java.io.EOFException: while trying to read 1981952 bytes > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockRec > eiver.java:270) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(Blo > ckReceiver.java:357) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(Bloc > kReceiver.java:378) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(Block > Receiver.java:534) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiv > er.java:417) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java > :122) > > 2011-03-08 19:40:41,465 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_- > 9221549395563181322_4024529 unfinalized and removed. > > 2011-03-08 19:40:41,466 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_- > 9221549395563181322_4024529 received exception java.io.EOFException: > while trying to read 3037288 bytes > > 2011-03-08 19:40:41,466 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.28.211:50050, storageID=DS-568746059- > 145.100.2.180-50050-1291128670510, infoPort=50075, > ipcPort=50020):DataXceiver > > java.io.EOFException: while trying to read 3037288 bytes > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockRec > eiver.java:270) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(Blo > ckReceiver.java:357) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(Bloc > kReceiver.java:378) > > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(Block > Receiver.java:534) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiv > er.java:417) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java > :122) > > > > Cheers, > > > > Evert Lammerts > > Consultant eScience & Cloud Services > > SARA Computing & Network Services > > Operations, Support & Development > > > > Phone: +31 20 888 4101 > > Email: [email protected] > > http://www.sara.nl > > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera
