Hi all,

I have encountered a very strange race condition during my testing which 
results in making the META region table being not-accessible as it was assigned 
to a region server which has been shut down (encountered a FATAL error).

Here is the scenario (using hadoop-0.20.1 and hbase-0.20.0 on a 3 node cluster)

pre condition
===============
cache01 (is the backup master, runs a region server has the root and meta 
assigned to it) 
cache02 (runs a region server)
search01 (runs the master and the region server)

scenario
=========
kill the master on search01

the master on cache01 resumes master duties

cache01 encounters a fatal error (FATAL 
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe) 
and has to exit

The root is getting re-assigned to the region server on search01 and the meta 
is getting re-assigned to the region server on cache02.

Now cache02 encounters the same fatal error (FATAL 
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe) 
and has to exit before it accepts the assignment for servicing the meta region

post condition
===============

While the root is assigned to search01 the meta appears to have been left in 
limbo state (I think it is still in regionsInTransitions map of the 
RegionManager). The issue I believe is because of a race condition.
The region server in cache02 never gets the chance to complete the assignment 
of the meta region. When cache01 realizes that cache02 has died in the 
ProcessServerShutdown it never checks to see whether the server that died had a 
meta region assigned to it in transition (isMetaServer method in the 
RegionManager checks for that). The result of this is that when my client 
connects it gets the cache02 address for the meta server and of course it keeps 
failing to connect.

To address this race condition i believe we simply have to check in the 
closeMetaRegions whether the deadServer isMetaServer and if it is add the 
MetaRegion in the list (I had to create a new method in the RegionManager to 
return the RegionInfo of the MetaRegion).

I have been unable though to verify my fix since I have been unable to 
replicate the above scenario.

Let me know what you guys think. I have attached links to the logs at the end.

Also I would appreciate if you can tell what could have caused the fatal error 
on the region servers (I am sure it is clearly something related with me 
killing master nodes).

Thanks in advance,

=======
master logs on cache01: http://pastebin.com/m61f4893d
regionserver logs on cache01: http://pastebin.com/m56e4302b
regionserver logs on cache02: http://pastebin.com/m11fac0e6
regionserver logs on search01: http://pastebin.com/d667f876c
(For the FATAL errors)
namenode on cache01: http://pastebin.com/dc020387
datanode on cache01: http://pastebin.com/ma25decd

Yannis.

--
Search for the Pulse

Yannis Pavlidis | OneRiot
Softwarist
talk: 720.771.7025
write: [email protected]
web: www.oneriot.com


Reply via email to