[
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830926#comment-15830926
]
Gary Helmling commented on HBASE-17381:
---------------------------------------
[~openinx] Thanks for the patch. This is an improvement.
Do you think it ever makes sense to let the ReplicationSourceWorkerThread die
and not abort the regionserver? In this case, replication will simply stop
with little visibility into it happening.
It seems better to me to just always abort in the case of an unexpected
Throwable thrown from runLoop(). Anything recoverable should be handled in a
try/catch in runLoop itself.
> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -----------------------------------------------------------------
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Gary Helmling
> Assignee: huzheng
> Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the
> run() method (for example failure to allocate direct memory for the DFS
> client), the exception will be logged by the UncaughtExceptionHandler, but
> the thread will also die and the replication queue will back up indefinitely
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it
> can actually handle. For those that it really can't, it seems better to
> abort the regionserver rather than just allow replication to stop with
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in
> ReplicationSourceWorkerThread,
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:96)
> at
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:113)
> at
> org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:108)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)