virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536979403
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##########
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn,
ServerMetrics sl) throws YouAreDea
// the ServerName to use. Here we presume a master has already done
// that so we'll press on with whatever it gave us for ServerName.
if (!checkAndRecordNewServer(sn, sl)) {
- LOG.info("RegionServerReport ignored, could not record the server: " +
sn);
- return; // Not recorded, so no need to move on
+ // Master already registered server with same (host + port) and higher
startcode.
Review Comment:
When it happened (as per logs mentioned on the jira), master processed the
report and that generated inconsistencies.
We have seen this happen many times in the past when regionserver is not
really aborted but looses connection with Zookeeper, triggering SCP by master.
And regionserver with new startcode is not only alive but has also reported
regionservers to master. After that, somehow master still receives regionserver
report from old startcode regionserver, master processes it and that results
into inconsistencies. I know this is rare case but it definitely happened more
than once in more than one prod clusters.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]