I have seen this before. The last guess was that it's a bug somewhere in the HDFS client (one of my colleagues was looking into it at the time). It's missed an ack'd seq number and probably will never recover without a dn and rs restart. I'll try and dig up any of the pertinent info.
What version of HDFS are you running ? Same version of client and server ? Anything happening with the network at the time ? Was there a nn fail over at the time ? On Tue, Sep 3, 2013 at 7:49 PM, Himanshu Vashishtha <[email protected]> wrote: > Looking at the jstack, log roller and log syncer, both are blocked to get > the sequence number: > {code} > "regionserver60020.logRoller" daemon prio=10 tid=0x00007f317007f800 > nid=0x27ee6 in Object.wait() [0x00007f318acd8000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708) > - locked <0x00007f34ae7b3638> (a java.util.LinkedList) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609) > > ..... > "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800 > nid=0x27ee5 in Object.wait() [0x00007f318add9000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708) > - locked <0x00007f34ae7b3638> (a java.util.LinkedList) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609) > > {code} > > > This blocks other append ops. > > What do you see in the NN and DN logs, which has this log file. Can you > pastebin NN, DN logs along with their jstack. > > On another note, I don't see the above exception in the log you attached. > Is that really the meta regionserver log? All I could see for meta table is > that it is calling MetaEditor to update meta, like an ordinary client. You > seem to have your own set of Handlers? > blah... "COP IPC Server handler 87 on 60020:" blah.... > > > Thanks, > Himanshu > > > On Mon, Sep 2, 2013 at 8:30 PM, Mickey <[email protected]> wrote: > >> Hi Himanshu, >> It lasted for more than one hour. At last I tried to stop the region >> server in and failed. From the jstack it was still blocked by >> >> the HLog syncer. So I kill the process with "kill -9" and then the HBase >> got well. >> >> hbase.regionserver.logroll.errors.tolerated is the default value 0. >> >> My HBase cluster is mainly based on 0.94.1. >> >> Attachment is the region server which contains the .META. and the jstack >> when it is blocked. >> >> Thanks, >> Mickey >> >> >> >> 2013/9/2 Himanshu Vashishtha <[email protected]> >> >>> Hey Mickey, >>> >>> I have few followup questions: >>> >>> For how long these threads blocked? What happens afterwards, regionserver >>> resumes, or aborts? >>> And, could you pastebin the logs after the above exception? >>> Sync failure causes a log roll, which is retried based on value of >>> hbase.regionserver.logroll.errors.tolerated >>> Which 0.94 version you are using? >>> >>> Thanks, >>> Himanshu >>> >>> >>> >>> On Mon, Sep 2, 2013 at 5:16 AM, Mickey <[email protected]> wrote: >>> >>> > Hi, all >>> > >>> > I was testing HBase with HDFS QJM HA recently. Hadoop version is CDH >>> 4.3.0 >>> > and HBase is based on 0.94 with some patches(include HBASE-8211) >>> > In a test, I met a blocking issue in HBase. I killed a node which is >>> the >>> > active namenode, also datanode, regionserver on it. >>> > >>> > The HDFS fail over successfully. The master tried re-assign the regions >>> > after detecting the regionserver down. But no region can be online. >>> > >>> > From the log I found all operations to .META. failed. Printing the >>> jstack >>> > of the region server who contains the .META. , I found info below: >>> > "regionserver60020.logSyncer" daemon prio=10 tid=0x00007f317007e800 >>> > nid=0x27ee5 in Object.wait() [0x00007f318add9000] >>> > java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> > at java.lang.Object.wait(Native Method) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1708) >>> > - locked <0x00007f34ae7b3638> (a java.util.LinkedList) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1609) >>> > at >>> > org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1525) >>> > at >>> > org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1510) >>> > at >>> > >>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116) >>> > at >>> > org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:1208) >>> > at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source) >>> > at >>> > >>> > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> > at java.lang.reflect.Method.invoke(Method.java:597) >>> > at >>> > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:303) >>> > at >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1290) >>> > at >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1247) >>> > at >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1400) >>> > at >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1199) >>> > at java.lang.Thread.run(Thread.java:662) >>> > >>> > The logSyncer is always waiting on waitForAckedSeqno. All the HLog >>> > operations seems blocked. Is this a bug? Or I missed some important >>> > patches? >>> > >>> > Hope to get your suggestions soon. >>> > >>> > Best regards, >>> > Mickey >>> > >>> >> >>
