The no log messages for 3 hours reminds me of an odd OS level failure that
would happen on some machines.
The underlying host file system would get into a deadlock state, and the
hadoop processes would attempt to write a log message and hang.
The first noticeable symptom of this is that the machines had multiple
instances of update-db running (a once per day scan of the file system to
prime the locate command's cache).
This was not resolved by the time I left, and the monitoring was modified to
catch the failure earlier.

Sagar, did this ever get resolved?


On Wed, Apr 15, 2009 at 12:45 AM, Rakhi Khatwani
<[email protected]>wrote:

> Hi,
>
> I was running a mapreduce job which takes data from table ContentTable,
> processes it, and store the results into another table.
> my mapreduce program had 20 maps out of which 19 maps completed
> successfully
> the last map however took ages to complete.... after 10 hrs we had to kill
> the task (at 15-Apr-2009 04:59:39 (10hrs, 30mins, 3sec)).
>
>
> here are the regionserver logs around that time and its really weird....
> there were no logs for 3 hrs!!! :(
>
> 2009-04-15 02:21:43,417 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction
> check on ContentTable,
> http://www.dnaindia.com/report.asp?newsid=1243858,1239719376495
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567)
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:65)
> 2009-04-15 02:21:43,417 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction
> check on ContentTable,http://www.cnbc.com//id/29864724,1239692396718
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567)
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:65)
> 2009-04-15 05:08:23,414 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction
> check on ContentTable,
>
> http://blog.taragana.com/n/lovelorn-fiza-to-act-in-desh-drohi-sequel-24445/,1239692371324
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567)
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:65)
> 2009-04-15 05:08:23,414 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed major compaction
> check on ContentTable,
>
> http://www.modernghana.com/news/208936/1/past-present-and-future-of-the-indian-national-con.html,1239718472792
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:567)
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:226)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.getLowestTimestamp(HStore.java:785)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:988)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.isMajorCompaction(HStore.java:976)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2585)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:843)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:65)
>
>
> But still, the entire log is filled with this warning! is it serious?? or
> can it be ignored?
>
>
> the datanode logs are fine uptill 2009-04-15 05:07:12 where i get the
> following exception.
>
> 2009-04-15 05:07:12,093 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-1660273199073776411_91663 received exception java.io.IOException:
> Block
> blk_-1660273199073776411_91663 is valid, and cannot be written to.
> 2009-04-15 05:07:12,093 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.255.127.31:50010,
> storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-1660273199073776411_91663 is valid, and
> cannot be written to.
>        at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:958)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:98)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:258)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-04-15 05:07:13,671 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.255.127.31:50010,
> storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075,
> ipcPort=50020) Starting thread to transfer block
> blk_5200295531482229843_91665 to 10.254.22.255:50010
> 2009-04-15 05:07:13,672 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.255.127.31:50010,
> storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075,
> ipcPort=50020) Starting thread to transfer block
> blk_-1660273199073776411_91663 to 10.255.107.224:50010
> 2009-04-15 05:07:14,161 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.255.127.31:50010,
> storageID=DS-1366610166-10.255.127.31-50010-1239371098677, infoPort=50075,
> ipcPort=50020):Transmitted block blk_5200295531482229843_91665 to /
> 10.254.22.255:50010
>
> and i have set dataxceivers to 2048
>
> What could be the issue?
>
> Thanks
> Raakhi
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to