[jira] [Comment Edited] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

huzheng (JIRA) Thu, 05 Jan 2017 06:23:20 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801468#comment-15801468
 ]


huzheng edited comment on HBASE-17381 at 1/5/17 2:22 PM:
---------------------------------------------------------

[~ghelmling] I upload a patch to abort region server if OOME occur.  for other 
exception cases,  I throw them (ReplicationSourceWorkerThread exit and region 
server keep running) because it seems hard to make sure  whether  it's 
recoverable case or not .


was (Author: openinx):
[~ghelmling] I upload a patch to abort region server if OOME occur.  for other 
exception cases,  I throw them because it seems hard to make sure  whether  
it's recoverable case or not (ReplicationSourceWorkerThread exit and region 
server keep running).

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -----------------------------------------------------------------
>
>                 Key: HBASE-17381
>                 URL: https://issues.apache.org/jira/browse/HBASE-17381
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Gary Helmling
>         Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

Reply via email to