Re: All region server died due to "Parent directory doesn't exist"

Stack Thu, 09 May 2013 10:35:15 -0700

Want to send me a regionserver log Lars? (off-list)
St.Ack


On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:

> Thanks Ted and Varun.
>
>
> Let me check on the .META. server.
>
>
> The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> So that would point to general issue. I did not see any ZK issues but I'll
> double check.
>
>
> It is just interesting that even now, if I start and RS it aborts within a
> minute or two, because of this issue.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Ted Yu <yuzhih...@gmail.com>
> To: dev@hbase.apache.org
> Cc:
> Sent: Thursday, May 9, 2013 9:51 AM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
> Thanks Varun for sharing your experience.
>
> Lars:
> Was the server carrying .META. functioning properly around the time when
> you observed the problem ?
>
> Cheers
>
> On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> > cluster. I am not sure if you are seeing the exact same issue though. We
> > did not have mass failures at the same time due to this..
> >
> > Thanks
> > Varun
> >
> >
> > On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Btw, I am not 100 % sure but I have some seen something like this
> before:
> > >
> > > 1) ZK connection flakiness causes ephemeral nodes to expire
> > > 2) Master detects failure and renames the logs into a splitting
> directory
> > > - this is intentional so that in case that region server comes back up,
> > it
> > > cannot write to the logs being split
> > > 3) Region server dies because the log is renamed
> > >
> > > So, the yanking away of files is done by the HBase master and is
> expected
> > > if the master feels the server is dead. We found that the Region server
> > > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> > > suspected that this is some kind of DFS error but when we really go
> upto
> > > the point where it started, we found some zookeeper session issues.
> > >
> > > We had two cases of this - either super high load or NTP/no clock
> > > synchronization b/w the clusters causing this issue for us.
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > >> Thanks Ted. I'll do the same.
> > >>
> > >>
> > >> ----- Original Message -----
> > >> From: Ted Yu <yuzhih...@gmail.com>
> > >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> Cc:
> > >> Sent: Thursday, May 9, 2013 9:07 AM
> > >> Subject: Re: All region server died due to "Parent directory doesn't
> > >> exist"
> > >>
> > >> I went through the patch for HBASE-7824 one more time and didn't find
> > >> direct correlation to the issue Lars reported.
> > >>
> > >> I am going over the other JIRAs in Lars' list.
> > >>
> > >> Cheers
> > >>
> > >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >>
> > >> > I will try. I do not think this is the issue, though.
> > >> >
> > >> > The master is up in my case.
> > >> > Right now the cluster is in a state where each region server aborts
> > >> itself
> > >> > shortly after being started (which coincides with having it's log
> > >> directory
> > >> > renamed to ...-splitting).
> > >> >
> > >> >
> > >> > This is a test cluster and I could just start from scratch... This
> > >> appears
> > >> > to be a serious enough problem, though, and I would like to track
> down
> > >> the
> > >> > issue.
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ----- Original Message -----
> > >> > From: Ted Yu <yuzhih...@gmail.com>
> > >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> > >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
> > >> > Sent: Thursday, May 9, 2013 2:04 AM
> > >> > Subject: Re: All region server died due to "Parent directory doesn't
> > >> exist"
> > >> >
> > >> > The config came from hbase-7824.
> > >> >
> > >> > There are other JIRAs in Lars' list which are related to log
> > splitting.
> > >> >
> > >> > I think more investigation is needed.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurt...@apache.org>
> > wrote:
> > >> >
> > >> > > So that is HBASE-7824, right?
> > >> > >
> > >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> > >> > >
> > >> > >> hbase.master.wait.for.log.splitting
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best regards,
> > >> > >
> > >> > >   - Andy
> > >> > >
> > >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > >> Hein
> > >> > > (via Tom White)
> > >> >
> > >> >
> > >>
> > >>
> > >
> >
>
>

Re: All region server died due to "Parent directory doesn't exist"

Reply via email to