[
https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709216#comment-13709216
]
Jean-Daniel Cryans commented on HBASE-8919:
-------------------------------------------
Finally got another test that failed with the stack trace, here it is (from
http://54.241.6.143/job/HBase-0.95/org.apache.hbase$hbase-server/610/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/):
{noformat}
2013-07-12 21:17:00,382 INFO [Thread-962]
regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to
ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599 failed on local
exception: java.nio.channels.ClosedByInterruptException
java.io.IOException: Call to
ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599 failed on local
exception: java.nio.channels.ClosedByInterruptException
at
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1401)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1373)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:15213)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1466)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$2.run(ReplicationSource.java:793)
Caused by: java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:343)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:231)
at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220)
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1014)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1349)
{noformat}
> TestReplicationQueueFailover (and Compressed) can fail because the recovered
> queue gets stuck on ClosedByInterruptException
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-8919
> URL: https://issues.apache.org/jira/browse/HBASE-8919
> Project: HBase
> Issue Type: Bug
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Attachments: HBASE-8919.patch
>
>
> Looking at this build:
> https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> The only thing I can find that went wrong is that the recovered queue was not
> completely done because the source fails like this:
> {noformat}
> 2013-07-10 11:53:51,538 INFO [Thread-1259]
> regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to
> hemera.apache.org/140.211.11.27:38614 failed on local exception:
> java.nio.channels.ClosedByInterruptException
> {noformat}
> And just before that it got:
> {noformat}
> 2013-07-10 11:53:51,290 WARN
> [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379]
> regionserver.ReplicationSource(661): Can't replicate because of an error on
> the remote cluster:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException):
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 1594 actions: FailedServerException: 1594 times,
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106)
> at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689)
> at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161)
> at
> org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122)
> at
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829)
> at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369)
> at
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
> at
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177)
> at
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376)
> {noformat}
> I wonder what's closing the socket with an interrupt, it seems it still needs
> to replicate more data. I'll start by adding the stack trace for the message
> when it fails to replicate on a "local exception". Also I found a thread that
> wasn't shutdown properly that I'm going to fix to help with debugging.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira