Haoze Wu created HBASE-28014:
--------------------------------
Summary: Transient underlying HDFS failure causes permanent
replciation failure between 2 HBase Clusters
Key: HBASE-28014
URL: https://issues.apache.org/jira/browse/HBASE-28014
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.0.0
Reporter: Haoze Wu
We are doing systematic testing in HBase stable release 2.0.0 and found
transient underlying HDFS failure may cause some flaky behaviors in replication
between HBase clusters.
We just utilize the fundamental setup in TestReplicationBase.java, in which we
start the first cluster with 2 slaves and the second cluster with 4 slaves.
Then it sets the second cluster to be the slave cluster for the first cluster
to test replication.
{code:java}
...
utility1.startMiniCluster(2);
utility2.startMiniCluster(4);
...
hbaseAdmin.addReplicationPeer("2", rpc);
...{code}
After all setup are done, we just run runSimplePutdeleteTest() in
TestReplicationBase.java in which we just add one row to the common topic in
the first cluster and then check that row is successfully replicated to the
common topic in the second cluster. However, when there is one transient HDFS
failure happening, this test becomes flaky.
In hbaseAdmin.addReplicationPeer(“2”,rpc) during setup, the zknodes for the two
regionservers in the first cluster are changed first to reflect the adding
cluster 2 as replication peer:
{code:java}
2023-08-03 10:19:55,137 DEBUG [Time-limited test-EventThread]
zookeeper.ZKUtil(355): regionserver:40051-0x189bbc4c84c0005,
quorum=localhost:62113, baseZNode=/1 Set watcher on existing
znode=/1/replication/peers/2
2023-08-03 10:19:55,137 DEBUG [Time-limited test-EventThread]
zookeeper.ZKUtil(355): regionserver:40693-0x189bbc4c84c0004,
quorum=localhost:62113, baseZNode=/1 Set watcher on existing
znode=/1/replication/peers/2 {code}
After that, the ZKWatcher at each regionserver will sense that and make
corresponding state changes in HBase implementation in this stack trace:
{code:java}
at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:367)
at
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.init(HBaseInterClusterReplicationEndpoint.java:160)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.getReplicationSource(ReplicationSourceManager.java:509)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.addSource(ReplicationSourceManager.java:263)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.peerListChanged(ReplicationSourceManager.java:626)
at
org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl$PeersWatcher.nodeChildrenChanged(ReplicationTrackerZKImpl.java:191)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.process(ZKWatcher.java:481)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505){code}
However, in CommonFSUtils#getRootDir(), it would call getFileSystem() to step
into the HDFS interface in which an IOException could possibly be thrown
because of underlying HDFS failures.
{code:java}
public static Path getRootDir(final Configuration c) throws IOException {
Path p = new Path(c.get(HConstants.HBASE_DIR));
FileSystem fs = p.getFileSystem(c);
return p.makeQualified(fs.getUri(), fs.getWorkingDirectory());
}{code}
When the IOException is thrown, it would be catched in
ReplicationSourceManager#peerListChanged() and just print one error log. Note
that this failure is permanent and there is no recovery mechanism present to
retry. So there will be permanent divergence between zknodes states and HBase
states.
It is worth noting that the state changes in HBase implementation for the two
regionservers are performed concurrently and highly-separately in two
eventThread so there could be the situation that the addingPeer for one
regionServer is successful while the other is failed.
Afterwards when we run runSimplePutdeleteTest() to add one row to the first
cluster, the flaky behavior emerges. If the regionserver that successfully
added the peer took charge of the row, the second cluster could view the
changes and the test would pass. However, if the regionserver that failed to
add cluster 2 as the replication peer took charge of it, then the second
cluster would not sense the new row and fail the test.
We saw these as problematic because it would cause the permanent divergence
between the metadata in zknodes and the real states in HBase implementation and
the flaky behavior is really bad.
We are not sure what is the best place to mitigate the issue. One potential fix
we just tried is to just retry nodeChildrenChanged() afterwards with a backoff
period and it could successfully fix the problem.
Any comments and suggestions would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)