Thanks Stack. I sent the logs. Also, I have since bounced HDFS and ZK and the problem is gone now (I can start RSs again and they stay up). Something got into a weird state.
-- Lars ________________________________ From: Stack <st...@duboce.net> To: HBase Dev List <dev@hbase.apache.org>; lars hofhansl <la...@apache.org> Sent: Thursday, May 9, 2013 10:34 AM Subject: Re: All region server died due to "Parent directory doesn't exist" Want to send me a regionserver log Lars? (off-list) St.Ack On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote: Thanks Ted and Varun. > > >Let me check on the .META. server. > > >The majority (13) of the RSs died within 2 minutes. The remaining 3 died over >the following 10 minutes. >So that would point to general issue. I did not see any ZK issues but I'll >double check. > > >It is just interesting that even now, if I start and RS it aborts within a >minute or two, because of this issue. > > >-- Lars > > >----- Original Message ----- >From: Ted Yu <yuzhih...@gmail.com> >To: dev@hbase.apache.org > >Cc: >Sent: Thursday, May 9, 2013 9:51 AM >Subject: Re: All region server died due to "Parent directory doesn't exist" > >Thanks Varun for sharing your experience. > >Lars: >Was the server carrying .META. functioning properly around the time when >you observed the problem ? > >Cheers > >On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote: > >> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase >> cluster. I am not sure if you are seeing the exact same issue though. We >> did not have mass failures at the same time due to this.. >> >> Thanks >> Varun >> >> >> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> wrote: >> >> > Btw, I am not 100 % sure but I have some seen something like this before: >> > >> > 1) ZK connection flakiness causes ephemeral nodes to expire >> > 2) Master detects failure and renames the logs into a splitting directory >> > - this is intentional so that in case that region server comes back up, >> it >> > cannot write to the logs being split >> > 3) Region server dies because the log is renamed >> > >> > So, the yanking away of files is done by the HBase master and is expected >> > if the master feels the server is dead. We found that the Region server >> > logs DFS exceptions like crazy (1000s of them) in that case and we always >> > suspected that this is some kind of DFS error but when we really go upto >> > the point where it started, we found some zookeeper session issues. >> > >> > We had two cases of this - either super high load or NTP/no clock >> > synchronization b/w the clusters causing this issue for us. >> > >> > Thanks >> > Varun >> > >> > >> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> wrote: >> > >> >> Thanks Ted. I'll do the same. >> >> >> >> >> >> ----- Original Message ----- >> >> From: Ted Yu <yuzhih...@gmail.com> >> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> >> >> Cc: >> >> Sent: Thursday, May 9, 2013 9:07 AM >> >> Subject: Re: All region server died due to "Parent directory doesn't >> >> exist" >> >> >> >> I went through the patch for HBASE-7824 one more time and didn't find >> >> direct correlation to the issue Lars reported. >> >> >> >> I am going over the other JIRAs in Lars' list. >> >> >> >> Cheers >> >> >> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> wrote: >> >> >> >> > I will try. I do not think this is the issue, though. >> >> > >> >> > The master is up in my case. >> >> > Right now the cluster is in a state where each region server aborts >> >> itself >> >> > shortly after being started (which coincides with having it's log >> >> directory >> >> > renamed to ...-splitting). >> >> > >> >> > >> >> > This is a test cluster and I could just start from scratch... This >> >> appears >> >> > to be a serious enough problem, though, and I would like to track down >> >> the >> >> > issue. >> >> > >> >> > -- Lars >> >> > >> >> > >> >> > >> >> > ----- Original Message ----- >> >> > From: Ted Yu <yuzhih...@gmail.com> >> >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org> >> >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org> >> >> > Sent: Thursday, May 9, 2013 2:04 AM >> >> > Subject: Re: All region server died due to "Parent directory doesn't >> >> exist" >> >> > >> >> > The config came from hbase-7824. >> >> > >> >> > There are other JIRAs in Lars' list which are related to log >> splitting. >> >> > >> >> > I think more investigation is needed. >> >> > >> >> > Cheers >> >> > >> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurt...@apache.org> >> wrote: >> >> > >> >> > > So that is HBASE-7824, right? >> >> > > >> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > > >> >> > >> hbase.master.wait.for.log.splitting >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > Best regards, >> >> > > >> >> > > - Andy >> >> > > >> >> > > Problems worthy of attack prove their worth by hitting back. - Piet >> >> Hein >> >> > > (via Tom White) >> >> > >> >> > >> >> >> >> >> > >> > >