[
https://issues.apache.org/jira/browse/HBASE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854865#action_12854865
]
stack commented on HBASE-2413:
------------------------------
As to why we failed allocate meta, looking in log I see two restarts of 019 (
2010-03-31 17:37:39,882 and 2010-03-31 17:49:43,667). After the first restart,
all looks to be restored to normal only if you look at the emissions from load
balancer it says:
1782 2010-03-31 17:48:46,521 INFO
org.apache.hadoop.hbase.master.ServerManager: 2 region servers, 0 dead, average
load 9.0
So we are off kilter. There should be 3 servers showing at this stage. If
you look at this message in src, you'll see that count is from a map keyed by
host+port. 019 was removed from list as part of the processing of the crash.
So, there will be more churn of regions than there should be.
Next 020 expires at 2010-03-31 17:50:57,004. It was carrying .META. It
expired 'naturally' w/ znode expired. Start message comes in at '2010-03-31
17:51:15,385'. We remove meta from online regions list so we will immediately
reassign it. Meantime we are processing a message close from 018 because
balancer is working to balance the churning cluster.
The message close can't complete because it has built in the expectation that
meta is available.
Missing from this patch is review of all of these message processing items in
the master package. Need to make sure that close, open, etc., are requeued to
try later if meta is null (as was the case here).
Need to figure how to write a test for this stuff.
> Master does not respect generation stamps, may result in meta getting
> permanently offlined
> ------------------------------------------------------------------------------------------
>
> Key: HBASE-2413
> URL: https://issues.apache.org/jira/browse/HBASE-2413
> Project: Hadoop HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.20.3
> Reporter: Karthik Ranganathan
> Assignee: stack
> Attachments: newserver.txt
>
>
> This happens if the RS is restarted before the zk node expires. The sequence
> is as follows:
> 1. RS1 dies - lets say its server string was HOST1:PORT1:TS1
> 2. In a few seconds RS1 is restarted, it comes up as HOST1:PORT1:TS2 (TS2 is
> more recent than TS1)
> 3. Master gets a start up message from RS1 with the server name as
> HOST1:PORT1:TS2
> 4. Master adds this as a new RS, tries to red
> ---- The master does not use the generation stamps to detect that RS1 has
> already restarted.
> ---- Also, if RS1 contained meta, master would try to go to HOST1:PORT1:TS1.
> It would end up talking to HOST1:PORT1:TS2, which spews a bunch of not
> serving region exceptions.
> 5. zk node expires for HOST1:PORT1:TS1
> 6. Master tries to process shutdown for HOST1:PORT1:TS1 - this probably
> interferes with HOST1:PORT1:TS2 and ends up somehow removing the reassign
> meta in the master's queue.
> ---- Meta never comes online and master continues logging the following
> exception indefinitely:
> 2010-04-06 11:02:23,988 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of test1,7094000000,1270220428234, false,
> reassign: true
> 2010-04-06 11:02:23,988 DEBUG
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: Exception in
> RetryableMetaOperation:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
> at
> org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
> at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.