[
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952514#comment-16952514
]
Karthick commented on HBASE-23169:
----------------------------------
[~wchevreuil] I've included the TRACE logs when the Regionserver failed.
{code:java}
2019-10-15 20:32:01,902 INFO [main-SendThread(172.17.17.17:2191)]
zookeeper.ClientCnxn: Socket connection established to
172.17.17.17/172.17.17.17:2191, initiating session
2019-10-15 20:32:01,904 INFO [main-SendThread(172.17.17.17:2191)]
zookeeper.ClientCnxn: Session establishment complete on server
172.17.17.17/172.17.17.17:2191, sessionid = 0x166d3a44fd582762, negotiated
timeout = 40000
2019-10-15 20:32:01,906 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:01,930 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:01,934 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:01,966 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:01,970 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:01,989 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:01,993 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,033 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,036 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,055 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,058 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,063 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,068 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,095 DEBUG
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.WALEntryStream: Reached the end of WAL file
'hdfs://OtherGridMetaCluster/hbasedata/WALs/172.72.72.72,16020,1570500915276/172.72.72.72%2C16020%2C1570500915276.1571196349507'.
It was not closed cleanly, so we did not parse 8 bytes of data. This is
normally ok.
2019-10-15 20:32:02,095 TRACE
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.WALEntryStream: Reached the end of log
hdfs://OtherGridMetaCluster/hbasedata/WALs/172.72.72.72,16020,1570500915276/172.72.72.72%2C16020%2C1570500915276.1571196349507,
and the length of the file is 127513914
2019-10-15 20:32:02,095 DEBUG
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.WALEntryStream: Reached the end of log
hdfs://OtherGridMetaCluster/hbasedata/WALs/172.72.72.72,16020,1570500915276/172.72.72.72%2C16020%2C1570500915276.1571196349507
2019-10-15 20:32:02,101 DEBUG
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceManager: Removing 1 logs in the list:
[172.72.72.72%2C16020%2C1570500915276.1571196349507]
2019-10-15 20:32:02,108 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,110 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,130 TRACE
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceWALReaderThread: Read 1 WAL entries eligible for
replication
2019-10-15 20:32:02,130 TRACE
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HBaseInterClusterReplicationEndpoint: Submitting 1 entries of
total size 376
2019-10-15 20:32:02,131 TRACE [pool-29-thread-198]
regionserver.HBaseInterClusterReplicationEndpoint: Replicating batch 817993410
of 1 entries with total size 347 bytes to
2019-10-15 20:32:02,133 TRACE
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries
from WAL
2019-10-15 20:32:02,188 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,192 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,192 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=0,queue=0,port=16020]
regionserver.ReplicationSink: Started replicating mutations.
2019-10-15 20:32:02,195 TRACE [pool-29-thread-198]
regionserver.HBaseInterClusterReplicationEndpoint: Completed replicating batch
817993410
2019-10-15 20:32:02,197 DEBUG
[RpcServer.replication.FPBQ.Fifo.handler=0,queue=0,port=16020]
regionserver.ReplicationSink: Finished replicating mutations.
2019-10-15 20:32:02,198 FATAL
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HRegionServer: ABORTING region server
172.72.72.72,16020,1570500915276: Failed to write replication wal position
(filename=172.72.72.72%2C16020%2C1570500915276.1571196349507, position=63690)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for
/hbase/replication/rs/172.72.72.72,16020,1570500915276/2/172.72.72.72%2C16020%2C1570500915276.1571196349507
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868)
at
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)
2019-10-15 20:32:02,198 FATAL
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2019-10-15 20:32:02,206 INFO
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HRegionServer: tputSize_num_ops" : 1180,
"FlushOutputSize_min" : 0,
"FlushOutputSize_max" : 0,
"FlushOutputSize_mean" : 0,
"FlushOutputSize_25th_percentile" : 0,
"FlushOutputSize_median" : 0,
"FlushOutputSize_75th_percentile" : 0,
"FlushOutputSize_90th_percentile" : 0,
"FlushOutputSize_95th_percentile" : 0,
"FlushOutputSize_98th_percentile" : 0,
"FlushOutputSize_99th_percentile" : 0,
"FlushOutputSize_99.9th_percentile" : 0,
"Bulkload_count" : 0,
"Bulkload_mean_rate" : 0.0,
"Bulkload_1min_rate" : 0.0,
"Bulkload_5min_rate" : 0.0,
"Bulkload_15min_rate" : 0.0,
"Bulkload_num_ops" : 0,
"Bulkload_min" : 0,
"Bulkload_max" : 0,
"Bulkload_mean" : 0,
"Bulkload_25th_percentile" : 0,
"Bulkload_median" : 0,
"Bulkload_75th_percentile" : 0,
"Bulkload_90th_percentile" : 0,
"Bulkload_95th_percentile" : 0,
"Bulkload_98th_percentile" : 0,
"Bulkload_99th_percentile" : 0,
"Bulkload_99.9th_percentile" : 0
} ]
2019-10-15 20:32:02,231 INFO
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HRegionServer: STOPPED: Failed to write replication wal position
(filename=172.72.72.72%2C16020%2C1570500915276.1571196349507, position=63690)
2019-10-15 20:32:02,231 DEBUG
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
2019-10-15 20:32:02,231 INFO [regionserver//172.72.72.72:16020]
regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2019-10-15 20:32:02,238 INFO [regionserver//172.72.72.72:16020]
regionserver.HRegionServer: Stopping infoServer
2019-10-15 20:32:02,238 TRACE
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSource: Replicated 638 entries in total, or 638
operations in 65 ms
2019-10-15 20:32:02,238 INFO [SplitLogWorker-172.72.72.72:16020]
regionserver.SplitLogWorker: SplitLogWorker interrupted. Exiting.
2019-10-15 20:32:02,244 INFO [SplitLogWorker-172.72.72.72:16020]
regionserver.SplitLogWorker: SplitLogWorker 172.72.72.72,16020,1570500915276
exiting
{code}
> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
> Key: HBASE-23169
> URL: https://issues.apache.org/jira/browse/HBASE-23169
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication, wal
> Affects Versions: 1.4.10, 1.4.11
> Reporter: Karthick
> Assignee: Wellington Chevreuil
> Priority: Blocker
> Labels: patch
>
> After applying the patch given in
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region
> server aborts were noticed. This happens in ReplicationResourceShipper thread
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
> 08:17:28,133 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)