apurtell commented on code in PR #5513:
URL: https://github.com/apache/hbase/pull/5513#discussion_r1388603796
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java:
##########
@@ -1399,23 +1399,46 @@ private void
createMissingCFsInMetaDuringUpgrade(TableDescriptor metaDescriptor)
* Check hbase:meta is up and ready for reading. For use during Master
startup only.
* @return True if meta is UP and online and startup can progress.
Otherwise, meta is not online
* and we will hold here until operator intervention.
+ * @throws IOException If the master restart is required.
*/
@InterfaceAudience.Private
- public boolean waitForMetaOnline() {
+ public boolean waitForMetaOnline() throws IOException {
return isRegionOnline(RegionInfoBuilder.FIRST_META_REGIONINFO);
}
/**
+ * Wait until the region is reported online on a live regionserver.
+ * @param ri Region info.
* @return True if region is online and scannable else false if an error or
shutdown (Otherwise we
* just block in here holding up all forward-progess).
+ * @throws IOException If the master restart is required.
*/
- private boolean isRegionOnline(RegionInfo ri) {
+ private boolean isRegionOnline(RegionInfo ri) throws IOException {
RetryCounter rc = null;
while (!isStopped()) {
RegionState rs =
this.assignmentManager.getRegionStates().getRegionState(ri);
if (rs != null && rs.isOpened()) {
if (this.getServerManager().isServerOnline(rs.getServerName())) {
return true;
+ } else {
+ LOG.info("{} has state {} but the server {} is not online,
scheduling recovery.",
+ ri.getRegionNameAsString(), rs, rs.getServerName());
+ this.getServerManager().expireServer(rs.getServerName(), true);
+ // If already many SCPs are scheduled, but they are not progressing
because of
+ // meta's unavailability, the best action item is to throw
PleaseRestartMasterException
+ // and let new active master init take care of on-lining meta and
process all other
+ // pending SCPs. It's worth waiting for ~20s before arriving at the
conclusion, rather
Review Comment:
Is this true? Have we really thought about prioritizing execution of the
procedures related to meta and namespace before the others? I agree this could
lead to complex code.
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java:
##########
@@ -1399,23 +1399,46 @@ private void
createMissingCFsInMetaDuringUpgrade(TableDescriptor metaDescriptor)
* Check hbase:meta is up and ready for reading. For use during Master
startup only.
* @return True if meta is UP and online and startup can progress.
Otherwise, meta is not online
* and we will hold here until operator intervention.
+ * @throws IOException If the master restart is required.
*/
@InterfaceAudience.Private
- public boolean waitForMetaOnline() {
+ public boolean waitForMetaOnline() throws IOException {
return isRegionOnline(RegionInfoBuilder.FIRST_META_REGIONINFO);
}
/**
+ * Wait until the region is reported online on a live regionserver.
+ * @param ri Region info.
* @return True if region is online and scannable else false if an error or
shutdown (Otherwise we
* just block in here holding up all forward-progess).
+ * @throws IOException If the master restart is required.
*/
- private boolean isRegionOnline(RegionInfo ri) {
+ private boolean isRegionOnline(RegionInfo ri) throws IOException {
RetryCounter rc = null;
while (!isStopped()) {
RegionState rs =
this.assignmentManager.getRegionStates().getRegionState(ri);
if (rs != null && rs.isOpened()) {
if (this.getServerManager().isServerOnline(rs.getServerName())) {
return true;
+ } else {
+ LOG.info("{} has state {} but the server {} is not online,
scheduling recovery.",
+ ri.getRegionNameAsString(), rs, rs.getServerName());
+ this.getServerManager().expireServer(rs.getServerName(), true);
+ // If already many SCPs are scheduled, but they are not progressing
because of
+ // meta's unavailability, the best action item is to throw
PleaseRestartMasterException
+ // and let new active master init take care of on-lining meta and
process all other
+ // pending SCPs. It's worth waiting for ~20s before arriving at the
conclusion, rather
+ // than looping through procedures to figure out how/when/why they
are able to or not
+ // able to make any progress and eventually abort master
initialization anyway.
+ Threads.sleep(20000);
Review Comment:
Interrupt is not handled here.
But, more generally, do not unconditionally wait for so long. Use the Waiter
pattern. Why 20 seconds and not some other value? Why not make this
configurable?
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java:
##########
@@ -1399,23 +1399,46 @@ private void
createMissingCFsInMetaDuringUpgrade(TableDescriptor metaDescriptor)
* Check hbase:meta is up and ready for reading. For use during Master
startup only.
* @return True if meta is UP and online and startup can progress.
Otherwise, meta is not online
* and we will hold here until operator intervention.
+ * @throws IOException If the master restart is required.
*/
@InterfaceAudience.Private
- public boolean waitForMetaOnline() {
+ public boolean waitForMetaOnline() throws IOException {
return isRegionOnline(RegionInfoBuilder.FIRST_META_REGIONINFO);
}
/**
+ * Wait until the region is reported online on a live regionserver.
+ * @param ri Region info.
* @return True if region is online and scannable else false if an error or
shutdown (Otherwise we
* just block in here holding up all forward-progess).
+ * @throws IOException If the master restart is required.
*/
- private boolean isRegionOnline(RegionInfo ri) {
+ private boolean isRegionOnline(RegionInfo ri) throws IOException {
RetryCounter rc = null;
while (!isStopped()) {
RegionState rs =
this.assignmentManager.getRegionStates().getRegionState(ri);
if (rs != null && rs.isOpened()) {
if (this.getServerManager().isServerOnline(rs.getServerName())) {
return true;
+ } else {
+ LOG.info("{} has state {} but the server {} is not online,
scheduling recovery.",
+ ri.getRegionNameAsString(), rs, rs.getServerName());
+ this.getServerManager().expireServer(rs.getServerName(), true);
+ // If already many SCPs are scheduled, but they are not progressing
because of
+ // meta's unavailability, the best action item is to throw
PleaseRestartMasterException
+ // and let new active master init take care of on-lining meta and
process all other
+ // pending SCPs. It's worth waiting for ~20s before arriving at the
conclusion, rather
+ // than looping through procedures to figure out how/when/why they
are able to or not
+ // able to make any progress and eventually abort master
initialization anyway.
+ Threads.sleep(20000);
+ rs = this.assignmentManager.getRegionStates().getRegionState(ri);
+ if (rs != null && rs.isOpened()) {
+ if (this.getServerManager().isServerOnline(rs.getServerName())) {
+ return true;
+ } else {
+ throw new PleaseRestartMasterException("meta is still not online
on live server yet");
Review Comment:
What you check here, for this particular region, is if it is supposed to be
online on a server, that the server is considered online. But if that one
server is not online, that could be for many reasons unrelated to the state of
the meta table.
Also, this is the same logic as if we would just loop around again, so why
not do that? D.R.Y.
If you want to break out of this loop because meta is not available you
should make an explicit check of the region state for the meta regions.
Finally, do we really need to restart the master here or is there some other
way to address this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]