[
https://issues.apache.org/jira/browse/HBASE-14401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745221#comment-14745221
]
ramkrishna.s.vasudevan commented on HBASE-14401:
------------------------------------------------
I got this in the latest trunk code
{code}
r exception for block
BP-134581926-10.224.54.69-1440773710983:blk_1073748067_7278
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2280)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2015-09-15 21:35:29,637 INFO
[regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343]
compactions.PressureAwareCompactionThroughputController:
test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9.#info#1 average
throughput is 5.16 MB/sec, slept 28 time(s) and total slept time is 51877 ms. 1
active compactions remaining, total limit is 12.86 MB/sec
2015-09-15 21:35:29,712 WARN
[regionserver/stobdtserver2/10.224.54.69:16041.append-pool3-t1] wal.FSHLog:
Append sequenceId=503, requesting roll of WAL
java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2015-09-15 21:35:29,734 ERROR
[regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343]
regionserver.CompactSplitThread: Compaction failed Request =
regionName=test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9.,
storeName=info, fileCount=3, fileSize=343.1 M (114.3 M, 114.4 M, 114.5 M),
priority=7, time=14621388953502371
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: On sync
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1792)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1670)
at
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append
sequenceId=503, requesting roll of WAL
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1893)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1748)
... 5 more
Caused by: java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
{code}
Digging in more. I did find an issue with the DNs.
AFter this I had
{code}
java.io.IOException: cannot get log writer
at
org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:346)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:708)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:673)
at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist:
/hbase3/WALs/stobdtserver2,16041,1442333060894
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:2236)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2367)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2315)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2266)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:542)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:369)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
{code}
Note that I had replication enabled but I doubt where that could cause this.
Will check more.
> Stamp failed appends with sequenceid too.... Cleans up latches
> --------------------------------------------------------------
>
> Key: HBASE-14401
> URL: https://issues.apache.org/jira/browse/HBASE-14401
> Project: HBase
> Issue Type: Sub-task
> Components: test, wal
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: 14401.txt, 14401.v7.txt, 14401.v7.txt, 14401.v7.txt,
> 14401v3.txt, 14401v3.txt, 14401v3.txt, 14401v6.txt
>
>
> Looking in test output I see we can sometimes get stuck waiting on
> sequenceid... The parent issues redo of our semantic makes it so we encounter
> failed append more often around damaged WAL.
> This patch makes it so we stamp sequenceid always, even if the append fails.
> This way all sequenceids are accounted for but more important, the latch on
> sequenceid down in WALKey will be cleared.. where before it was not being
> cleared (there is no global list of outstanding WALKeys waiting on
> sequenceids so no way to clean them up... we don't need such a list if we
> ALWAYS stamp the sequenceid).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)