Master failover can split logs of live servers
----------------------------------------------
Key: HBASE-3380
URL: https://issues.apache.org/jira/browse/HBASE-3380
Project: HBase
Issue Type: Bug
Reporter: Jean-Daniel Cryans
Priority: Blocker
Fix For: 0.90.0
The reason why TestMasterFailover fails is that when it does the master
failover, the new master doesn't wait long enough for all region servers to
checkin so it goes ahead and split logs... which doesn't work because of the
way lease timeouts work:
{noformat}
2010-12-21 07:30:36,977 DEBUG [Master:0;vesta.apache.org:33170]
wal.HLogSplitter(256): Splitting hlog 1 of 1:
hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204,
length=0
2010-12-21 07:30:36,977 DEBUG [WriterThread-1]
wal.HLogSplitter$WriterThread(619): Writer thread
Thread[WriterThread-1,5,main]: starting
2010-12-21 07:30:36,977 DEBUG [WriterThread-2]
wal.HLogSplitter$WriterThread(619): Writer thread
Thread[WriterThread-2,5,main]: starting
2010-12-21 07:30:36,977 INFO [Master:0;vesta.apache.org:33170]
util.FSUtils(625): Recovering file
hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
2010-12-21 07:30:36,979 WARN [IPC Server handler 8 on 49187]
namenode.FSNamesystem(1122): DIR* NameSystem.startFile:
failed to create file
/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
for
DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1,
because this file is already being created by
DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166 on 127.0.0.1
...
2010-12-21 07:33:44,332 WARN [Master:0;vesta.apache.org:33170]
util.FSUtils(644): Waited 187354ms for lease recovery on
hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create
file
/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
for DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1,
because this file is already
being created by
DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166 on 127.0.0.1
{noformat}
I think that we should always check in ZK the number of live region servers
before waiting for them to check in, this way we know how many we should expect
during failover. There's also a case where we still want to timeout, since RS
can die during that time, but we should wait a bit longer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.