[jira] Commented: (HBASE-1439) race between master and regionserver after missed heartbeat

Jean-Daniel Cryans (JIRA) Tue, 07 Jul 2009 07:24:47 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728121#action_12728121
 ]


Jean-Daniel Cryans commented on HBASE-1439:
-------------------------------------------

Since both the RSs and the Master (with HBASE-1575) will handle session 
expiration, the possibility that a region server deletes its hlogs after 
sleeping for too long is greatly lessened as it will just abort and restart. 
The other possibility, as I described in the previous post, is that the region 
server cannot reach any ZK server for any reason and gets detected as dead by 
the Master which can still reach a ZK server. When I first tested that, it 
seemed like a very serious issue but now that I gave it some thought I think 
that it if happened HDFS would probably be wedged too by such a network 
partition. 

Anyways, the fix would be to have a timer in a generic Watch specialization 
that catches all disconnections and that after ~tickTime*1.5 would send a 
fabricated session expiration. IMO this is a nice to have for 0.20.0. Should we 
punt this?

> race between master and regionserver after missed heartbeat
> -----------------------------------------------------------
>
>                 Key: HBASE-1439
>                 URL: https://issues.apache.org/jira/browse/HBASE-1439
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.1
>         Environment: CentOS 5.2 x86_64, HBase 0.19.1, Hadoop 0.19.1
>            Reporter: Andrew Purtell
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>
> Seen on one of our 0.19.1 clusters:
> {code}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000
> /data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415)
>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679)
>  at 
> org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
>  at 
> org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426)
>  at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753)
>  at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:716)
>  at 
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:249)
>  at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:442)
>  at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:377)
> 2009-05-17 04:05:55,481 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: process
> shutdown of server 10.3.134.207:60020: logSplit: false, rootRescanned: false, 
> numberOfMetaRegions: 1,
> onlineMetaRegions.size(): 1
> {code}
> I do not have the region server log yet, but here is my conjecture:
> Here, the write ahead log for 10.3.134.207 is missing in DFS: 
> java.io.FileNotFoundException: 
> hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000/data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
>  when the master tries to split it after declaring the region server crashed. 
> There have been recent trouble reports on this cluster that indicate severe 
> memory stress, e.g. kernel panics due to OOM. Based on that I think it is 
> likely that the region server here missed a heartbeat so the master declared 
> it crashed and began to split the log. But, the log was then deleted out from 
> underneath the master's split thread. I think the region server was actually 
> still running but partially swapped out or the node was otherwise overloaded 
> so it missed its heartbeat. Then, when the region server came back after 
> being swapped in, it realized it missed its heartbeat and shut down, deleting 
> the log as is normally done. 
> Even if the above is not the actual cause in this case, I think the scenario 
> is plausible. What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1439) race between master and regionserver after missed heartbeat

Reply via email to