Re: Region server blocked at waitForAckedSeqno

Mickey Wed, 04 Sep 2013 19:27:07 -0700

Himanshu,
I'm also wondering there seems no log about the DFSClinet or
DFSOutputStream at first. But the code in
waitForAckedSeqno seems will loop forever until an ack or interrupted...
The other set of handler is just like the replication handler. Please just
ignore it.


Elliott Clark,
My HDFS is the version in CDH 4.3.0.  No version problem in cluster.

I was trying to test NameNode HA, so I kill the process violently.  In the
test I killed the active namenode,datanode and regionserver on the same
node. The name node failed over successfully. But then I found HBase
blocked at the logSyncer.

I will try to collect the HDFS log at that moment to see if there is
something useful.
I didn't pick the jstack for the NN and DN, so no jstack this time.





2013/9/4 Elliott Clark <[email protected]>

> I have seen this before.  The last guess was that it's a bug somewhere
> in the HDFS client (one of my colleagues was looking into it at the
> time).  It's missed an ack'd seq number and probably will never
> recover without a dn and rs restart.  I'll try and dig up any of the
> pertinent info.
>
> What version of HDFS are you running ?
> Same version of client and server ?
> Anything happening with the network at the time ?
> Was there a nn fail over at the time ?
>
>
>
> On Tue, Sep 3, 2013 at 7:49 PM, Himanshu Vashishtha <[email protected]>
> wrote:
> > Looking at the jstack, log roller and log syncer, both are blocked to get
> > the sequence number:
> > {code}
> > "regionserver60020.logRoller" daemon prio=10 tid=0x00007f317007f800
> > nid=0x27ee6 in Object.wait() [0x00007f318acd8000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >         at java.lang.Object.wait(Native Method)
> >         at
> >
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708)
> >         - locked <0x00007f34ae7b3638> (a java.util.LinkedList)
> >         at
> >
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609)
> >
> > .....
> > "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800
> > nid=0x27ee5 in Object.wait() [0x00007f318add9000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >         at java.lang.Object.wait(Native Method)
> >         at
> >
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708)
> >         - locked <0x00007f34ae7b3638> (a java.util.LinkedList)
> >         at
> >
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609)
> >
> > {code}
> >
> >
> > This blocks other append ops.
> >
> > What do you see in the NN and DN logs, which has this log file. Can you
> > pastebin NN, DN logs along with their jstack.
> >
> > On another note, I don't see the above exception in the log you attached.
> > Is that really the meta regionserver log? All I could see for meta table
> is
> > that it is calling MetaEditor to update meta, like an ordinary client.
> You
> > seem to have your own set of Handlers?
> > blah... "COP IPC Server handler 87 on 60020:" blah....
> >
> >
> > Thanks,
> > Himanshu
> >
> >
> > On Mon, Sep 2, 2013 at 8:30 PM, Mickey <[email protected]> wrote:
> >
> >> Hi Himanshu,
> >> It lasted for more than one hour. At last I tried to stop the region
> >> server in and failed. From the jstack it was still blocked by
> >>
> >> the HLog syncer. So I kill the process with "kill -9" and then the HBase
> >> got well.
> >>
> >> hbase.regionserver.logroll.errors.tolerated is the default value 0.
> >>
> >> My HBase cluster is mainly based on 0.94.1.
> >>
> >> Attachment is the region server which contains the .META. and the jstack
> >> when it is blocked.
> >>
> >> Thanks,
> >> Mickey
> >>
> >>
> >>
> >> 2013/9/2 Himanshu Vashishtha <[email protected]>
> >>
> >>> Hey Mickey,
> >>>
> >>> I have few followup questions:
> >>>
> >>> For how long these threads blocked? What happens afterwards,
> regionserver
> >>> resumes, or aborts?
> >>> And, could you pastebin the logs after the above exception?
> >>> Sync failure causes a log roll, which is retried based on value of
> >>> hbase.regionserver.logroll.errors.tolerated
> >>> Which 0.94 version you are using?
> >>>
> >>> Thanks,
> >>> Himanshu
> >>>
> >>>
> >>>
> >>> On Mon, Sep 2, 2013 at 5:16 AM, Mickey <[email protected]> wrote:
> >>>
> >>> > Hi, all
> >>> >
> >>> > I was testing HBase with HDFS QJM HA recently. Hadoop version is CDH
> >>> 4.3.0
> >>> > and HBase is based on 0.94 with some patches(include HBASE-8211)
> >>> > In a test, I met a blocking issue in HBase.  I killed a node which is
> >>> the
> >>> > active namenode, also datanode, regionserver on it.
> >>> >
> >>> > The HDFS fail over successfully. The master tried re-assign the
> regions
> >>> > after detecting the regionserver down. But no region can be online.
> >>> >
> >>> > From the log I found all operations to .META. failed. Printing the
> >>> jstack
> >>> > of the region server who contains the .META. , I found info below:
> >>> > "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800
> >>> > nid=0x27ee5 in Object.wait() [0x00007f318add9000]
> >>> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >>> >         at java.lang.Object.wait(Native Method)
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708)
> >>> >         - locked <0x00007f34ae7b3638> (a java.util.LinkedList)
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609)
> >>> >         at
> >>> >
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1525)
> >>> >         at
> >>> >
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1510)
> >>> >         at
> >>> >
> >>>
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> >>> >         at
> >>> >
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:1208)
> >>> >         at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown
> Source)
> >>> >         at
> >>> >
> >>> >
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:303)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1290)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1247)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1400)
> >>> >         at
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1199)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> >
> >>> > The logSyncer is always waiting on waitForAckedSeqno. All the HLog
> >>> > operations seems blocked. Is this a bug? Or I missed some important
> >>> > patches?
> >>> >
> >>> > Hope to get your suggestions soon.
> >>> >
> >>> > Best regards,
> >>> > Mickey
> >>> >
> >>>
> >>
> >>
>

Re: Region server blocked at waitForAckedSeqno

Reply via email to