The no log messages for 3 hours reminds me of an odd OS level failure that would happen on some machines. The underlying host file system would get into a deadlock state, and the hadoop processes would attempt to write a log message and hang. The first noticeable symptom of this is that the machines had multiple instances of update-db running (a once per day scan of the file system to prime the locate command's cache). This was not resolved by the time I left, and the monitoring was modified to catch the failure earlier.
Sagar, did this ever get resolved? On Wed, Apr 15, 2009 at 12:45 AM, Rakhi Khatwani <[email protected]>wrote: > Hi, > > I was running a mapreduce job which takes data from table ContentTable, > processes it, and store the results into another table. > my mapreduce program had 20 maps out of which 19 maps completed > successfully > the last map however took ages to complete.... after 10 hrs we had to kill > the task (at 15-Apr-2009 04:59:39 (10hrs, 30mins, 3sec)). > > > here are the regionserver logs around that time and its really weird.... > there were no logs for 3 hrs!!! :( > > 2009-04-15 02:21:43,417 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction > check on ContentTable, > http://www.dnaindia.com/report.asp?newsid=1243858,1239719376495 > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567) > at > > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226) > at > > org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976) > at > > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843) > at org.apache.hadoop.hbase.Chore.run(Chore.java:65) > 2009-04-15 02:21:43,417 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction > check on ContentTable,http://www.cnbc.com//id/29864724,1239692396718 > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567) > at > > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226) > at > > org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976) > at > > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843) > at org.apache.hadoop.hbase.Chore.run(Chore.java:65) > 2009-04-15 05:08:23,414 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction > check on ContentTable, > > http://blog.taragana.com/n/lovelorn-fiza-to-act-in-desh-drohi-sequel-24445/,1239692371324 > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567) > at > > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226) > at > > org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976) > at > > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843) > at org.apache.hadoop.hbase.Chore.run(Chore.java:65) > 2009-04-15 05:08:23,414 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction > check on ContentTable, > > http://www.modernghana.com/news/208936/1/past-present-and-future-of-the-indian-national-con.html,1239718472792 > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567) > at > > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226) > at > > org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988) > at > > org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976) > at > > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843) > at org.apache.hadoop.hbase.Chore.run(Chore.java:65) > > > But still, the entire log is filled with this warning! is it serious?? or > can it be ignored? > > > the datanode logs are fine uptill 2009-04-15 05:07:12 where i get the > following exception. > > 2009-04-15 05:07:12,093 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_-1660273199073776411_91663 received exception java.io.IOException: > Block > blk_-1660273199073776411_91663 is valid, and cannot be written to. > 2009-04-15 05:07:12,093 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.255.127.31:50010, > storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: Block blk_-1660273199073776411_91663 is valid, and > cannot be written to. > at > > org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:958) > at > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:98) > at > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:258) > at > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) > at java.lang.Thread.run(Thread.java:619) > 2009-04-15 05:07:13,671 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.255.127.31:50010, > storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075, > ipcPort=50020) Starting thread to transfer block > blk_5200295531482229843_91665 to 10.254.22.255:50010 > 2009-04-15 05:07:13,672 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.255.127.31:50010, > storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075, > ipcPort=50020) Starting thread to transfer block > blk_-1660273199073776411_91663 to 10.255.107.224:50010 > 2009-04-15 05:07:14,161 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.255.127.31:50010, > storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075, > ipcPort=50020):Transmitted block blk_5200295531482229843_91665 to / > 10.254.22.255:50010 > > and i have set dataxceivers to 2048 > > What could be the issue? > > Thanks > Raakhi > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
