[ 
https://issues.apache.org/jira/browse/HBASE-14401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745221#comment-14745221
 ] 

ramkrishna.s.vasudevan commented on HBASE-14401:
------------------------------------------------

I got this in the latest trunk code
{code}
r exception  for block 
BP-134581926-10.224.54.69-1440773710983:blk_1073748067_7278
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2280)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2015-09-15 21:35:29,637 INFO  
[regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343] 
compactions.PressureAwareCompactionThroughputController: 
test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9.#info#1 average 
throughput is 5.16 MB/sec, slept 28 time(s) and total slept time is 51877 ms. 1 
active compactions remaining, total limit is 12.86 MB/sec
2015-09-15 21:35:29,712 WARN  
[regionserver/stobdtserver2/10.224.54.69:16041.append-pool3-t1] wal.FSHLog: 
Append sequenceId=503, requesting roll of WAL
java.io.IOException: All datanodes 
DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2015-09-15 21:35:29,734 ERROR 
[regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343] 
regionserver.CompactSplitThread: Compaction failed Request = 
regionName=test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9., 
storeName=info, fileCount=3, fileSize=343.1 M (114.3 M, 114.4 M, 114.5 M), 
priority=7, time=14621388953502371
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: On sync
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1792)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1670)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append 
sequenceId=503, requesting roll of WAL
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1893)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1748)
        ... 5 more
Caused by: java.io.IOException: All datanodes 
DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
{code}
Digging in more. I did find an issue with the DNs.  
AFter this I had
{code}
java.io.IOException: cannot get log writer
        at 
org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:346)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:708)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:673)
        at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:144)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
/hbase3/WALs/stobdtserver2,16041,1442333060894
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:2236)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2367)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2315)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2266)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:542)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:369)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

{code}
Note that I had replication enabled but I doubt where that could cause this. 
Will check more. 

> Stamp failed appends with sequenceid too.... Cleans up latches
> --------------------------------------------------------------
>
>                 Key: HBASE-14401
>                 URL: https://issues.apache.org/jira/browse/HBASE-14401
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test, wal
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: 14401.txt, 14401.v7.txt, 14401.v7.txt, 14401.v7.txt, 
> 14401v3.txt, 14401v3.txt, 14401v3.txt, 14401v6.txt
>
>
> Looking in test output I see we can sometimes get stuck waiting on 
> sequenceid... The parent issues redo of our semantic makes it so we encounter 
> failed append more often around damaged WAL.
> This patch makes it so we stamp sequenceid always, even if the append fails. 
> This way all sequenceids are accounted for but more important, the latch on 
> sequenceid down in WALKey will be cleared.. where before it was not being 
> cleared (there is no global list of outstanding WALKeys waiting on 
> sequenceids so no way to clean them up... we don't need such a list if we 
> ALWAYS stamp the sequenceid).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to