But you see the zookeeper session timeout events in RS logs, and the master says that zk session for the RS's has expired, right?
On Thu, May 9, 2013 at 9:25 PM, lars hofhansl <la...@apache.org> wrote: > Still looking. Stack and Himanshu are looking too (tanks again!). > > What I do know is that it has to do the fencing mechanism during log > splitting. > Until I bounced HDFS and ZK (ZK probably being the culprit) each started > RegionServer would immediately be fenced off (it's log directory renamed). > Probably by the SSH. > > It is not clear what caused the first RS to die. While there is no direct > evidence, from the logs it looks like the log directory was just suddenly > renamed. > > I'll spend more time in the logs and also watch for this happening again. > > We did find another misconfigured cluster that had some services pointed > at this cluster. It does not look like that was actually a problem - there > is no evidence in the logs that this actually caused a problem, but it made > this deploy somewhat "special". > > > -- Lars > > > > ________________________________ > From: Enis Söztutar <enis....@gmail.com> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>; lars hofhansl < > la...@apache.org> > Sent: Thursday, May 9, 2013 6:10 PM > Subject: Re: All region server died due to "Parent directory doesn't exist" > > > > Could we able to find the root cause? > > > > On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <la...@apache.org> wrote: > > Good news is that as far as I can tell no data was lost. > >Eventually all logs were split and replayed. > > > > > > > >-- Lars > > > > > > > >----- Original Message ----- > > > >From: lars hofhansl <la...@apache.org> > >To: HBase Dev List <dev@hbase.apache.org> > > > >Cc: > >Sent: Thursday, May 9, 2013 11:13 AM > >Subject: Re: All region server died due to "Parent directory doesn't > exist" > > > >Thanks Stack. > > > >I sent the logs. > >Also, I have since bounced HDFS and ZK and the problem is gone now (I can > start RSs again and they stay up). Something got into a weird state. > > > > > >-- Lars > > > > > > > >________________________________ > >From: Stack <st...@duboce.net> > >To: HBase Dev List <dev@hbase.apache.org>; lars hofhansl < > la...@apache.org> > >Sent: Thursday, May 9, 2013 10:34 AM > >Subject: Re: All region server died due to "Parent directory doesn't > exist" > > > > > > > >Want to send me a regionserver log Lars? (off-list) > >St.Ack > > > > > > > >On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote: > > > >Thanks Ted and Varun. > >> > >> > >>Let me check on the .META. server. > >> > >> > >>The majority (13) of the RSs died within 2 minutes. The remaining 3 died > over the following 10 minutes. > >>So that would point to general issue. I did not see any ZK issues but > I'll double check. > >> > >> > >>It is just interesting that even now, if I start and RS it aborts within > a minute or two, because of this issue. > >> > >> > >>-- Lars > >> > >> > >>----- Original Message ----- > >>From: Ted Yu <yuzhih...@gmail.com> > >>To: dev@hbase.apache.org > >> > >>Cc: > >>Sent: Thursday, May 9, 2013 9:51 AM > >>Subject: Re: All region server died due to "Parent directory doesn't > exist" > >> > >>Thanks Varun for sharing your experience. > >> > >>Lars: > >>Was the server carrying .META. functioning properly around the time when > >>you observed the problem ? > >> > >>Cheers > >> > >>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> > wrote: > >> > >>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase > >>> cluster. I am not sure if you are seeing the exact same issue though. > We > >>> did not have mass failures at the same time due to this.. > >>> > >>> Thanks > >>> Varun > >>> > >>> > >>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> > wrote: > >>> > >>> > Btw, I am not 100 % sure but I have some seen something like this > before: > >>> > > >>> > 1) ZK connection flakiness causes ephemeral nodes to expire > >>> > 2) Master detects failure and renames the logs into a splitting > directory > >>> > - this is intentional so that in case that region server comes back > up, > >>> it > >>> > cannot write to the logs being split > >>> > 3) Region server dies because the log is renamed > >>> > > >>> > So, the yanking away of files is done by the HBase master and is > expected > >>> > if the master feels the server is dead. We found that the Region > server > >>> > logs DFS exceptions like crazy (1000s of them) in that case and we > always > >>> > suspected that this is some kind of DFS error but when we really go > upto > >>> > the point where it started, we found some zookeeper session issues. > >>> > > >>> > We had two cases of this - either super high load or NTP/no clock > >>> > synchronization b/w the clusters causing this issue for us. > >>> > > >>> > Thanks > >>> > Varun > >>> > > >>> > > >>> > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> > wrote: > >>> > > >>> >> Thanks Ted. I'll do the same. > >>> >> > >>> >> > >>> >> ----- Original Message ----- > >>> >> From: Ted Yu <yuzhih...@gmail.com> > >>> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> > >>> >> Cc: > >>> >> Sent: Thursday, May 9, 2013 9:07 AM > >>> >> Subject: Re: All region server died due to "Parent directory doesn't > >>> >> exist" > >>> >> > >>> >> I went through the patch for HBASE-7824 one more time and didn't > find > >>> >> direct correlation to the issue Lars reported. > >>> >> > >>> >> I am going over the other JIRAs in Lars' list. > >>> >> > >>> >> Cheers > >>> >> > >>> >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> > wrote: > >>> >> > >>> >> > I will try. I do not think this is the issue, though. > >>> >> > > >>> >> > The master is up in my case. > >>> >> > Right now the cluster is in a state where each region server > aborts > >>> >> itself > >>> >> > shortly after being started (which coincides with having it's log > >>> >> directory > >>> >> > renamed to ...-splitting). > >>> >> > > >>> >> > > >>> >> > This is a test cluster and I could just start from scratch... This > >>> >> appears > >>> >> > to be a serious enough problem, though, and I would like to track > down > >>> >> the > >>> >> > issue. > >>> >> > > >>> >> > -- Lars > >>> >> > > >>> >> > > >>> >> > > >>> >> > ----- Original Message ----- > >>> >> > From: Ted Yu <yuzhih...@gmail.com> > >>> >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org> > >>> >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org> > >>> >> > Sent: Thursday, May 9, 2013 2:04 AM > >>> >> > Subject: Re: All region server died due to "Parent directory > doesn't > >>> >> exist" > >>> >> > > >>> >> > The config came from hbase-7824. > >>> >> > > >>> >> > There are other JIRAs in Lars' list which are related to log > >>> splitting. > >>> >> > > >>> >> > I think more investigation is needed. > >>> >> > > >>> >> > Cheers > >>> >> > > >>> >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurt...@apache.org> > >>> wrote: > >>> >> > > >>> >> > > So that is HBASE-7824, right? > >>> >> > > > >>> >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhih...@gmail.com> > wrote: > >>> >> > > > >>> >> > >> hbase.master.wait.for.log.splitting > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > -- > >>> >> > > Best regards, > >>> >> > > > >>> >> > > - Andy > >>> >> > > > >>> >> > > Problems worthy of attack prove their worth by hitting back. - > Piet > >>> >> Hein > >>> >> > > (via Tom White) > >>> >> > > >>> >> > > >>> >> > >>> >> > >>> > > >>> > >> > >> > > >