umamaheswararao commented on code in PR #3649:
URL: https://github.com/apache/ozone/pull/3649#discussion_r936950996
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/OverReplicatedProcessor.java:
##########
@@ -73,7 +73,7 @@ public void processAll() {
try {
processContainer(overRep);
processed++;
- } catch (IOException e) {
+ } catch (Exception e) {
Review Comment:
I just looked at RM threads in HDFS and we catch throwable. I did not
remember how that evolved.
I think at least one instance I remember RM dead and NN was still running. (
Of course in a bad way).
We need to make sure process going down after this event.
```
@Override
public void run() {
while (namesystem.isRunning()) {
try {
// Process recovery work only when active NN is out of safe mode.
if (isPopulatingReplQueues()) {
computeDatanodeWork();
processPendingReconstructions();
rescanPostponedMisreplicatedBlocks();
}
TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
} catch (Throwable t) {
if (!namesystem.isRunning()) {
LOG.info("Stopping RedundancyMonitor.");
if (!(t instanceof InterruptedException)) {
LOG.info("RedundancyMonitor received an exception"
+ " while shutting down.", t);
}
break;
} else if (!checkNSRunning && t instanceof InterruptedException) {
LOG.info("Stopping RedundancyMonitor for testing.");
break;
}
LOG.error("RedundancyMonitor thread received Runtime exception. ",
t);
terminate(1, t);
}
}
}
```
I am not saying to copy this, but this sustained and evolved from many such
issues with similar use case in the system.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]