I'll +1 in favor of sinking it for the reason as if someone wants to use and run into this issue, it will stop the regionserver.
Lars: in case you need any help for the rc2 candidate, please pull me in. I'd love to do so. Thanks, Himanshu On Wed, Mar 13, 2013 at 8:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: > This was the JIRA that introduced copyQueuesFromRSUsingMulti(): > HBASE-2611 Handle RS that fails while processing the failure of another one > (Himanshu Vashishtha) > > It went into 0.94.5 > And the feature is off by default: > > <name>hbase.zookeeper.useMulti</name> > <value>false</value> > > The fact that Lars first reported the following problem meant that no other > user tried this feature. > > Hence I think 0.94.6 RC1 doesn't need to be sunk. > > Cheers > > On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <la...@apache.org> wrote: >> >> Hey no problem. It's cool that we found it in a test env. It's probably >> quite hard to reproduce. >> This is in 0.94.5 but this feature is off by default. >> >> What's the general thought here, should I kill the current 0.94.6 rc for >> this? >> My gut says: Yes. >> >> >> I'm also a bit worried about these: >> 2013-03-14 01:42:42,271 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication >> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0 >> 2013-03-14 01:42:42,358 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) >> at >> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) >> at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >> at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) >> 2013-03-14 01:42:42,358 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited >> too long for this file, considering dumping >> 2013-03-14 01:42:42,358 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable >> to open a reader, sleeping 1000 times 10 >> >> This happens after bouncing the cluster a 2nd time and these messages >> repeat every 10s (for hours now). This is a separate problem I think. >> >> -- Lars >> >> ________________________________ >> From: Himanshu Vashishtha <hvash...@cs.ualberta.ca> >> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> >> Cc: Ted Yu <yuzhih...@gmail.com> >> Sent: Wednesday, March 13, 2013 6:38 PM >> >> Subject: Re: Replication hosed after simple cluster restart >> >> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it >> might not be able to move later on, resulting in bogus znodes. >> I'll fix this asap. Weird it didn't happen in my testing earlier. >> Sorry about this. >> >> >> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <la...@apache.org> wrote: >> > Sorry 0.94.6RC1 >> > (I complain about folks not reporting the version all the time, and then >> > I do it too) >> > >> > >> > >> > ________________________________ >> > From: Ted Yu <yuzhih...@gmail.com> >> > To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> >> > Sent: Wednesday, March 13, 2013 6:17 PM >> > Subject: Re: Replication hosed after simple cluster restart >> > >> > >> > Did this happen on 0.94.5 ? >> > >> > Thanks >> > >> > >> > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <la...@apache.org> wrote: >> > >> > We just ran into an interesting scenario. We restarted a cluster that >> > was setup as a replication source. >> >>The stop went cleanly. >> >> >> >>Upon restart *all* regionservers aborted within a few seconds with >> >> variations of these errors: >> >>http://pastebin.com/3iQVuBqS >> >> >> >>This is scary! >> >> >> >>-- Lars >> >> >