Re: Replication hosed after simple cluster restart

Himanshu Vashishtha Wed, 13 Mar 2013 20:45:58 -0700

I'll +1 in favor of sinking it for the reason as if someone wants to
use and run into this issue, it will stop the regionserver.


Lars: in case you need any help for the rc2 candidate, please pull me
in. I'd love to do so.

Thanks,
Himanshu

On Wed, Mar 13, 2013 at 8:06 PM, Ted Yu <[email protected]> wrote:
> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
> HBASE-2611 Handle RS that fails while processing the failure of another one
> (Himanshu Vashishtha)
>
> It went into 0.94.5
> And the feature is off by default:
>
>     <name>hbase.zookeeper.useMulti</name>
>     <value>false</value>
>
> The fact that Lars first reported the following problem meant that no other
> user tried this feature.
>
> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>
> Cheers
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[email protected]> wrote:
>>
>> Hey no problem. It's cool that we found it in a test env. It's probably
>> quite hard to reproduce.
>> This is in 0.94.5 but this feature is off by default.
>>
>> What's the general thought here, should I kill the current 0.94.6 rc for
>> this?
>> My gut says: Yes.
>>
>>
>> I'm also a bit worried about these:
>> 2013-03-14 01:42:42,271 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication
>> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
>> java.io.EOFException
>>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
>> too long for this file, considering dumping
>> 2013-03-14 01:42:42,358 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
>> to open a reader, sleeping 1000 times 10
>>
>> This happens after bouncing the cluster a 2nd time and these messages
>> repeat every 10s (for hours now). This is a separate problem I think.
>>
>> -- Lars
>>
>> ________________________________
>> From: Himanshu Vashishtha <[email protected]>
>>
>> To: [email protected]; lars hofhansl <[email protected]>
>> Cc: Ted Yu <[email protected]>
>> Sent: Wednesday, March 13, 2013 6:38 PM
>>
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>> might not be able to move later on, resulting in bogus znodes.
>> I'll fix this asap. Weird it didn't happen in my testing earlier.
>> Sorry about this.
>>
>>
>> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <[email protected]> wrote:
>> > Sorry 0.94.6RC1
>> > (I complain about folks not reporting the version all the time, and then
>> > I do it too)
>> >
>> >
>> >
>> > ________________________________
>> >  From: Ted Yu <[email protected]>
>> > To: [email protected]; lars hofhansl <[email protected]>
>> > Sent: Wednesday, March 13, 2013 6:17 PM
>> > Subject: Re: Replication hosed after simple cluster restart
>> >
>> >
>> > Did this happen on 0.94.5 ?
>> >
>> > Thanks
>> >
>> >
>> > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <[email protected]> wrote:
>> >
>> > We just ran into an interesting scenario. We restarted a cluster that
>> > was setup as a replication source.
>> >>The stop went cleanly.
>> >>
>> >>Upon restart *all* regionservers aborted within a few seconds with
>> >> variations of these errors:
>> >>http://pastebin.com/3iQVuBqS
>> >>
>> >>This is scary!
>> >>
>> >>-- Lars
>>
>>
>

Re: Replication hosed after simple cluster restart

Reply via email to