[
https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404156#comment-15404156
]
Robert Yokota commented on HBASE-16270:
---------------------------------------
I went back to check the logs for our 6 node cluster. Unfortunately, I see the
UnexpectedStateException in all of the logs except the most recent, but nothing
before, as these errors swamped the logs. However, that was only the case for
5 out of the 6 region servers. On the 6th region server, I do not see any
UnexpectedStateException errors, but I do see the following exception swamp the
logs instead. Maybe this information will help.
2016-07-21 18:30:25,765 WARN
[regionserver/stagehbase-5.az2.bl2.yammer.com/10.243.237.61:16020.replicationSource,region_replica_replication]
regionserver.RegionReplicaReplicationEndpoint: Received IOException while
trying to replicatejava.io.IOException: java.net.SocketTimeoutException:
callTimeout=1200000, callDuration=1581759: null
at
org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:876)
at
org.apache.hadoop.hbase.wal.WALSplitter$EntryBuffers.appendEntry(WALSplitter.java:939)
at
org.apache.hadoop.hbase.replication.regionserver.RegionReplicaReplicationEndpoint.replicate(RegionReplicaReplicationEndpoint.java:290)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:694)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:406)
Caused by: java.net.SocketTimeoutException: callTimeout=1200000,
callDuration=1581759: null
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at
org.apache.hadoop.hbase.replication.regionserver.RegionReplicaReplicationEndpoint$RetryingRpcCallable.call(RegionReplicaReplicationEndpoint.java:606)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get
the location
at
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getRegionLocations(RegionAdminServiceCallable.java:178)
at
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getLocation(RegionAdminServiceCallable.java:105)
at
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.prepare(RegionAdminServiceCallable.java:89)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
... 5 more
Caused by: java.io.IOException: HRegionInfo was null in feedie:counters,
row=keyvalues={feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:regioninfo/1449083379635/Put/vlen=43/seqid=0,
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:seqnumDuringOpen/1469114520082/Put/vlen=8/seqid=0,
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:server/1469114520082/Put/vlen=37/seqid=0,
feedie:counters,*,1449083378490.b5edddf50e5a18f6ff013b10a4b82edd./info:serverstartcode/1469114520082/Put/vlen=8/seqid=0}
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1263)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
at
org.apache.hadoop.hbase.client.RegionAdminServiceCallable.getRegionLocations(RegionAdminServiceCallable.java:170)
... 8 more
> When using region replicas, get thousands of occurrences of
> UnexpectedStateException: Current snapshot id is -1
> ---------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-16270
> URL: https://issues.apache.org/jira/browse/HBASE-16270
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.1.2
> Reporter: Robert Yokota
> Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a
> staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 3600000
> hbase.master.loadbalancer.class =
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 30000
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the
> following error below in the HBase logs. In one instance the error occurred
> over 70K times. At this time, these region servers would see 10x write
> traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and
> in some instances would crash.
> At the same time, we would see secondary regions move and then go into the
> "reads are disabled" state for extended periods.
> It appears there is a race condition where the DefaultMemStore::clearSnapshot
> method might be called more than once in succession. The first call would set
> snapshotId to -1, but the second call would throw an exception. It seems the
> second call should just return if the snapshotId is already -1, rather than
> throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016,
> RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35},
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException:
> org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current
> snapshot id is -1,passed 1469085004304
> at
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
> at
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
> at
> org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
> at
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)