Want to send me a regionserver log Lars? (off-list) St.Ack
On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <la...@apache.org> wrote: > Thanks Ted and Varun. > > > Let me check on the .META. server. > > > The majority (13) of the RSs died within 2 minutes. The remaining 3 died > over the following 10 minutes. > So that would point to general issue. I did not see any ZK issues but I'll > double check. > > > It is just interesting that even now, if I start and RS it aborts within a > minute or two, because of this issue. > > -- Lars > > > ----- Original Message ----- > From: Ted Yu <yuzhih...@gmail.com> > To: dev@hbase.apache.org > Cc: > Sent: Thursday, May 9, 2013 9:51 AM > Subject: Re: All region server died due to "Parent directory doesn't exist" > > Thanks Varun for sharing your experience. > > Lars: > Was the server carrying .META. functioning properly around the time when > you observed the problem ? > > Cheers > > On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <va...@pinterest.com> wrote: > > > I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase > > cluster. I am not sure if you are seeing the exact same issue though. We > > did not have mass failures at the same time due to this.. > > > > Thanks > > Varun > > > > > > On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <va...@pinterest.com> > wrote: > > > > > Btw, I am not 100 % sure but I have some seen something like this > before: > > > > > > 1) ZK connection flakiness causes ephemeral nodes to expire > > > 2) Master detects failure and renames the logs into a splitting > directory > > > - this is intentional so that in case that region server comes back up, > > it > > > cannot write to the logs being split > > > 3) Region server dies because the log is renamed > > > > > > So, the yanking away of files is done by the HBase master and is > expected > > > if the master feels the server is dead. We found that the Region server > > > logs DFS exceptions like crazy (1000s of them) in that case and we > always > > > suspected that this is some kind of DFS error but when we really go > upto > > > the point where it started, we found some zookeeper session issues. > > > > > > We had two cases of this - either super high load or NTP/no clock > > > synchronization b/w the clusters causing this issue for us. > > > > > > Thanks > > > Varun > > > > > > > > > On Thu, May 9, 2013 at 9:16 AM, lars hofhansl <la...@apache.org> > wrote: > > > > > >> Thanks Ted. I'll do the same. > > >> > > >> > > >> ----- Original Message ----- > > >> From: Ted Yu <yuzhih...@gmail.com> > > >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> > > >> Cc: > > >> Sent: Thursday, May 9, 2013 9:07 AM > > >> Subject: Re: All region server died due to "Parent directory doesn't > > >> exist" > > >> > > >> I went through the patch for HBASE-7824 one more time and didn't find > > >> direct correlation to the issue Lars reported. > > >> > > >> I am going over the other JIRAs in Lars' list. > > >> > > >> Cheers > > >> > > >> On Thu, May 9, 2013 at 8:48 AM, lars hofhansl <la...@apache.org> > wrote: > > >> > > >> > I will try. I do not think this is the issue, though. > > >> > > > >> > The master is up in my case. > > >> > Right now the cluster is in a state where each region server aborts > > >> itself > > >> > shortly after being started (which coincides with having it's log > > >> directory > > >> > renamed to ...-splitting). > > >> > > > >> > > > >> > This is a test cluster and I could just start from scratch... This > > >> appears > > >> > to be a serious enough problem, though, and I would like to track > down > > >> the > > >> > issue. > > >> > > > >> > -- Lars > > >> > > > >> > > > >> > > > >> > ----- Original Message ----- > > >> > From: Ted Yu <yuzhih...@gmail.com> > > >> > To: "dev@hbase.apache.org" <dev@hbase.apache.org> > > >> > Cc: "dev@hbase.apache.org" <dev@hbase.apache.org> > > >> > Sent: Thursday, May 9, 2013 2:04 AM > > >> > Subject: Re: All region server died due to "Parent directory doesn't > > >> exist" > > >> > > > >> > The config came from hbase-7824. > > >> > > > >> > There are other JIRAs in Lars' list which are related to log > > splitting. > > >> > > > >> > I think more investigation is needed. > > >> > > > >> > Cheers > > >> > > > >> > On May 9, 2013, at 1:59 AM, Andrew Purtell <apurt...@apache.org> > > wrote: > > >> > > > >> > > So that is HBASE-7824, right? > > >> > > > > >> > > On Thu, May 9, 2013 at 4:33 PM, Ted Yu <yuzhih...@gmail.com> > wrote: > > >> > > > > >> > >> hbase.master.wait.for.log.splitting > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Best regards, > > >> > > > > >> > > - Andy > > >> > > > > >> > > Problems worthy of attack prove their worth by hitting back. - > Piet > > >> Hein > > >> > > (via Tom White) > > >> > > > >> > > > >> > > >> > > > > > > >