[ 
https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709216#comment-13709216
 ] 

Jean-Daniel Cryans commented on HBASE-8919:
-------------------------------------------

Finally got another test that failed with the stack trace, here it is (from 
http://54.241.6.143/job/HBase-0.95/org.apache.hbase$hbase-server/610/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/):

{noformat}
2013-07-12 21:17:00,382 INFO  [Thread-962] 
regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to 
ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599 failed on local 
exception: java.nio.channels.ClosedByInterruptException
java.io.IOException: Call to 
ip-10-196-81-100.us-west-1.compute.internal/10.196.81.100:39599 failed on local 
exception: java.nio.channels.ClosedByInterruptException
        at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1401)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1373)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:15213)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1466)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$2.run(ReplicationSource.java:793)
Caused by: java.nio.channels.ClosedByInterruptException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:343)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:231)
        at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1014)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1349)
{noformat}
                
> TestReplicationQueueFailover (and Compressed) can fail because the recovered 
> queue gets stuck on ClosedByInterruptException
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8919
>                 URL: https://issues.apache.org/jira/browse/HBASE-8919
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>         Attachments: HBASE-8919.patch
>
>
> Looking at this build: 
> https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> The only thing I can find that went wrong is that the recovered queue was not 
> completely done because the source fails like this:
> {noformat}
> 2013-07-10 11:53:51,538 INFO  [Thread-1259] 
> regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to 
> hemera.apache.org/140.211.11.27:38614 failed on local exception: 
> java.nio.channels.ClosedByInterruptException
> {noformat}
> And just before that it got:
> {noformat}
> 2013-07-10 11:53:51,290 WARN  
> [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379]
>  regionserver.ReplicationSource(661): Can't replicate because of an error on 
> the remote cluster: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException):
>  org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 1594 actions: FailedServerException: 1594 times, 
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697)
>       at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122)
>       at 
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829)
>       at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177)
>       at 
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376)
> {noformat}
> I wonder what's closing the socket with an interrupt, it seems it still needs 
> to replicate more data. I'll start by adding the stack trace for the message 
> when it fails to replicate on a "local exception". Also I found a thread that 
> wasn't shutdown properly that I'm going to fix to help with debugging.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to