[ 
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404156#comment-15404156
 ] 

Robert Yokota commented on HBASE-16270:
---------------------------------------

I went back to check the logs for our 6 node cluster.  Unfortunately, I see the 
UnexpectedStateException in all of the logs except the most recent, but nothing 
before, as these errors swamped the logs.  However, that was only the case for 
5 out of the 6 region servers.  On the 6th region server, I do not see any 
UnexpectedStateException errors, but I do see the following exception swamp the 
logs instead.  Maybe this information will help.

2016-07-21 18:30:25,765 WARN  
[regionserver/stagehbase-5.az2.bl2.yammer.com/10.243.237.61:16020.replicationSource,region_replica_replication]
 regionserver.RegionReplicaReplicationEndpoint: Received IOException while 
trying to replicatejava.io.IOException: java.net.SocketTimeoutException: 
callTimeout=1200000, callDuration=1581759: null
        at 
org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:876)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$EntryBuffers.appendEntry(WALSplitter.java:939)
        at 
org.apache.hadoop.hbase.replication.regionserver.RegionReplicaReplicationEndpoint.replicate(RegionReplicaReplicationEndpoint.java:290)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:694)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:406)
Caused by: java.net.SocketTimeoutException: callTimeout=1200000, 
callDuration=1581759: null
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
        at 
org.apache.hadoop.hbase.replication.regionserver.RegionReplicaReplicationEndpoint$RetryingRpcCallable.call(RegionReplicaReplicationEndpoint.java:606)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get 
the location
        at 
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getRegionLocations(RegionAdminServiceCallable.java:178)
        at 
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getLocation(RegionAdminServiceCallable.java:105)
        at 
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.prepare(RegionAdminServiceCallable.java:89)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
        ... 5 more
Caused by: java.io.IOException: HRegionInfo was null in feedie:counters, 
row=keyvalues={feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:regioninfo/1449083379635/Put/vlen=43/seqid=0,
 
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:seqnumDuringOpen/1469114520082/Put/vlen=8/seqid=0,
 
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:server/1469114520082/Put/vlen=37/seqid=0,
 
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:serverstartcode/1469114520082/Put/vlen=8/seqid=0}
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1263)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
        at 
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getRegionLocations(RegionAdminServiceCallable.java:170)
        ... 8 more



> When using region replicas, get thousands of occurrences of 
> UnexpectedStateException: Current snapshot id is -1
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-16270
>                 URL: https://issues.apache.org/jira/browse/HBASE-16270
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.1.2
>            Reporter: Robert Yokota
>         Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a 
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 3600000
> hbase.master.loadbalancer.class = 
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 30000
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the 
> following error below in the HBase logs.  In one instance the error occurred 
> over 70K times.  At this time, these region servers would see 10x write 
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and 
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the 
> "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot 
> method might be called more than once in succession. The first call would set 
> snapshotId to -1, but the second call would throw an exception.  It seems the 
> second call should just return if the snapshotId is already -1, rather than 
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, 
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: 
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current 
> snapshot id is -1,passed 1469085004304
>         at 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
>         at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to