But you see the zookeeper session timeout events in RS logs, and the master
says that zk session for the RS's has expired, right?


On Thu, May 9, 2013 at 9:25 PM, lars hofhansl <la...@apache.org> wrote:

> Still looking. Stack and Himanshu are looking too (tanks again!).
>
> What I do know is that it has to do the fencing mechanism during log
> splitting.
> Until I bounced HDFS and ZK (ZK probably being the culprit) each started
> RegionServer would immediately be fenced off (it's log directory renamed).
> Probably by the SSH.
>
> It is not clear what caused the first RS to die. While there is no direct
> evidence, from the logs it looks like the log directory was just suddenly
> renamed.
>
> I'll spend more time in the logs and also watch for this happening again.
>
> We did find another misconfigured cluster that had some services pointed
> at this cluster. It does not look like that was actually a problem - there
> is no evidence in the logs that this actually caused a problem, but it made
> this deploy somewhat "special".
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Enis Söztutar <enis....@gmail.com>
> To: "dev@hbase.apache.org" <dev@hbase.apache.org>; lars hofhansl <
> la...@apache.org>
> Sent: Thursday, May 9, 2013 6:10 PM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Could we able to find the root cause?
>
>
>
> On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote:
>
> Good news is that as far as I can tell no data was lost.
> >Eventually all logs were split and replayed.
> >
> >
> >
> >-- Lars
> >
> >
> >
> >----- Original Message -----
> >
> >From: lars hofhansl <la...@apache.org>
> >To: HBase Dev List <dev@hbase.apache.org>
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 11:13 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Stack.
> >
> >I sent the logs.
> >Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> >From: Stack <st...@duboce.net>
> >To: HBase Dev List <dev@hbase.apache.org>; lars hofhansl <
> la...@apache.org>
> >Sent: Thursday, May 9, 2013 10:34 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >
> >
> >Want to send me a regionserver log Lars? (off-list)
> >St.Ack
> >
> >
> >
> >On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >Thanks Ted and Varun.
> >>
> >>
> >>Let me check on the .META. server.
> >>
> >>
> >>The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >>So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >>
> >>
> >>It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >>
> >>
> >>-- Lars
> >>
> >>
> >>----- Original Message -----
> >>From: Ted Yu <yuzhih...@gmail.com>
> >>To: dev@hbase.apache.org
> >>
> >>Cc:
> >>Sent: Thursday, May 9, 2013 9:51 AM
> >>Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >>
> >>Thanks Varun for sharing your experience.
> >>
> >>Lars:
> >>Was the server carrying .META. functioning properly around the time when
> >>you observed the problem ?
> >>
> >>Cheers
> >>
> >>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >>> cluster. I am not sure if you are seeing the exact same issue though.
> We
> >>> did not have mass failures at the same time due to this..
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>>
> >>> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >>> >
> >>> > 1) ZK connection flakiness causes ephemeral nodes to expire
> >>> > 2) Master detects failure and renames the logs into a splitting
> directory
> >>> > - this is intentional so that in case that region server comes back
> up,
> >>> it
> >>> > cannot write to the logs being split
> >>> > 3) Region server dies because the log is renamed
> >>> >
> >>> > So, the yanking away of files is done by the HBase master and is
> expected
> >>> > if the master feels the server is dead. We found that the Region
> server
> >>> > logs DFS exceptions like crazy (1000s of them) in that case and we
> always
> >>> > suspected that this is some kind of DFS error but when we really go
> upto
> >>> > the point where it started, we found some zookeeper session issues.
> >>> >
> >>> > We had two cases of this - either super high load or NTP/no clock
> >>> > synchronization b/w the clusters causing this issue for us.
> >>> >
> >>> > Thanks
> >>> > Varun
> >>> >
> >>> >
> >>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >
> >>> >> Thanks Ted. I'll do the same.
> >>> >>
> >>> >>
> >>> >> ----- Original Message -----
> >>> >> From: Ted Yu <yuzhih...@gmail.com>
> >>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> >>> >> Cc:
> >>> >> Sent: Thursday, May 9, 2013 9:07 AM
> >>> >> Subject: Re: All region server died due to "Parent directory doesn't
> >>> >> exist"
> >>> >>
> >>> >> I went through the patch for HBASE-7824 one more time and didn't
> find
> >>> >> direct correlation to the issue Lars reported.
> >>> >>
> >>> >> I am going over the other JIRAs in Lars' list.
> >>> >>
> >>> >> Cheers
> >>> >>
> >>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org>
> wrote:
> >>> >>
> >>> >> > I will try. I do not think this is the issue, though.
> >>> >> >
> >>> >> > The master is up in my case.
> >>> >> > Right now the cluster is in a state where each region server
> aborts
> >>> >> itself
> >>> >> > shortly after being started (which coincides with having it's log
> >>> >> directory
> >>> >> > renamed to ...-splitting).
> >>> >> >
> >>> >> >
> >>> >> > This is a test cluster and I could just start from scratch... This
> >>> >> appears
> >>> >> > to be a serious enough problem, though, and I would like to track
> down
> >>> >> the
> >>> >> > issue.
> >>> >> >
> >>> >> > -- Lars
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > ----- Original Message -----
> >>> >> > From: Ted Yu <yuzhih...@gmail.com>
> >>> >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> >>> >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
> >>> >> > Sent: Thursday, May 9, 2013 2:04 AM
> >>> >> > Subject: Re: All region server died due to "Parent directory
> doesn't
> >>> >> exist"
> >>> >> >
> >>> >> > The config came from hbase-7824.
> >>> >> >
> >>> >> > There are other JIRAs in Lars' list which are related to log
> >>> splitting.
> >>> >> >
> >>> >> > I think more investigation is needed.
> >>> >> >
> >>> >> > Cheers
> >>> >> >
> >>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurt...@apache.org>
> >>> wrote:
> >>> >> >
> >>> >> > > So that is HBASE-7824, right?
> >>> >> > >
> >>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>> >> > >
> >>> >> > >> hbase.master.wait.for.log.splitting
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > >
> >>> >> > > --
> >>> >> > > Best regards,
> >>> >> > >
> >>> >> > >   - Andy
> >>> >> > >
> >>> >> > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> >>> >> Hein
> >>> >> > > (via Tom White)
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to