[
https://issues.apache.org/jira/browse/HBASE-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HBASE-10210:
-------------------------------------
Attachment: HBASE-10210.01.patch
We discussed w [~enis] and [~devaraj] here and we might want to eventually do
some solution where we get RS only from ZK, or only from heartbeats, or in
different order (ZK first). Meanwhile I guess we can just do the ts compare, I
don't care strongly either way. I kept the sync changes though, I am not
certain why there are no other bugs without them
> during master startup, RS can be you-are-dead-ed by master in error
> -------------------------------------------------------------------
>
> Key: HBASE-10210
> URL: https://issues.apache.org/jira/browse/HBASE-10210
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.0, 0.96.1, 0.99.0, 0.96.1.1
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HBASE-10210.01.patch, HBASE-10210.patch
>
>
> Not sure of the root cause yet, I am at "how did this ever work" stage.
> We see this problem in 0.96.1, but didn't in 0.96.0 + some patches.
> It looks like RS information arriving from 2 sources - ZK and server itself,
> can conflict. Master doesn't handle such cases (timestamp match), and anyway
> technically timestamps can collide for two separate servers.
> So, master YouAreDead-s the already-recorded reporting RS, and adds it too.
> Then it discovers that the new server has died with fatal error!
> Note the threads.
> Addition is called from master initialization and from RPC.
> {noformat}
> 2013-12-19 11:16:45,290 INFO
> [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
> Finished waiting for region servers count to settle; checked in 2, slept for
> 18262 ms, expecting minimum of 1, maximum of 2147483647, master is running.
> 2013-12-19 11:16:45,290 INFO
> [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
> Registering
> server=h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,290 INFO
> [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.HMaster: Registered
> server found up in zk but who has not yet reported in:
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO [RpcServer.handler=4,port=60000]
> master.ServerManager: Triggering server recovery; existingServer
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> looks stale, new
> server:h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO [RpcServer.handler=4,port=60000]
> master.ServerManager: Master doesn't enable ServerShutdownHandler during
> initialization, delay expiring server
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> ...
> 2013-12-19 11:16:46,925 ERROR [RpcServer.handler=7,port=60000]
> master.HMaster: Region server
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> reported a fatal error:
> ABORTING region server
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing
> h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 as
> dead server
> {noformat}
> Presumably some of the recent ZK listener related changes b
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)