[ https://issues.apache.org/jira/browse/HBASE-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129339#comment-13129339 ]
Jonathan Gray commented on HBASE-3380: -------------------------------------- What's the best practice here? Should I just commit this to 92 and trunk and make a note here? Should I open a new jira since this is so old? (Thanks for input guys) > Master failover can split logs of live servers > ---------------------------------------------- > > Key: HBASE-3380 > URL: https://issues.apache.org/jira/browse/HBASE-3380 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jonathan Gray > Priority: Blocker > Fix For: 0.90.0 > > Attachments: HBASE-3380-v1.patch, HBASE-3380-v2.patch > > > The reason why TestMasterFailover fails is that when it does the master > failover, the new master doesn't wait long enough for all region servers to > checkin so it goes ahead and split logs... which doesn't work because of the > way lease timeouts work: > {noformat} > 2010-12-21 07:30:36,977 DEBUG [Master:0;vesta.apache.org:33170] > wal.HLogSplitter(256): Splitting hlog 1 of 1: > > hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204, > length=0 > 2010-12-21 07:30:36,977 DEBUG [WriterThread-1] > wal.HLogSplitter$WriterThread(619): Writer thread > Thread[WriterThread-1,5,main]: starting > 2010-12-21 07:30:36,977 DEBUG [WriterThread-2] > wal.HLogSplitter$WriterThread(619): Writer thread > Thread[WriterThread-2,5,main]: starting > 2010-12-21 07:30:36,977 INFO [Master:0;vesta.apache.org:33170] > util.FSUtils(625): Recovering file > > hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204 > 2010-12-21 07:30:36,979 WARN [IPC Server handler 8 on 49187] > namenode.FSNamesystem(1122): DIR* NameSystem.startFile: > failed to create file > /user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204 > for > DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1, > because this file is already being created by > DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166 on > 127.0.0.1 > ... > 2010-12-21 07:33:44,332 WARN [Master:0;vesta.apache.org:33170] > util.FSUtils(644): Waited 187354ms for lease recovery on > > hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204: > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > create file > > /user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204 > for DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1, > because this file is already > being created by > DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166 on > 127.0.0.1 > {noformat} > I think that we should always check in ZK the number of live region servers > before waiting for them to check in, this way we know how many we should > expect during failover. There's also a case where we still want to timeout, > since RS can die during that time, but we should wait a bit longer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira