janhoy commented on a change in pull request #1387: SOLR-14210: Include replica 
health in healtcheck handler
URL: https://github.com/apache/lucene-solr/pull/1387#discussion_r401424526
 
 

 ##########
 File path: 
solr/core/src/java/org/apache/solr/handler/admin/HealthCheckHandler.java
 ##########
 @@ -88,15 +95,42 @@ public void handleRequestBody(SolrQueryRequest req, 
SolrQueryResponse rsp) throw
       return;
     }
 
-    // Set status to true if this node is in live_nodes
-    if 
(clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
-      rsp.add(STATUS, OK);
-    } else {
+    // Fail if not in live_nodes
+    if 
(!clusterState.getLiveNodes().contains(cores.getZkController().getNodeName())) {
       rsp.add(STATUS, FAILURE);
       rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "Host Unavailable: 
Not in live nodes as per zk"));
+      return;
     }
 
-    rsp.setHttpCaching(false);
+    // Optionally require that all cores on this node are active if param 
'failWhenRecovering=true'
+    if (req.getParams().getBool(PARAM_REQUIRE_HEALTHY_CORES, false)) {
+      List<String> unhealthyCores = findUnhealthyCores(clusterState, 
cores.getNodeConfig().getNodeName());
+      if (unhealthyCores.size() > 0) {
+          rsp.add(STATUS, FAILURE);
+          rsp.setException(new 
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,
+                  "Replica(s) " + unhealthyCores + " are currently 
initializing or recovering"));
+          return;
+      }
+      rsp.add("MESSAGE", "All cores are healthy");
+    }
+
+    // All lights green, report healthy
+    rsp.add(STATUS, OK);
+  }
+
+  /**
+   * Find replicas DOWN or RECOVERING
+   * @param clusterState clusterstate from ZK
+   * @param nodeName this node name
+   * @return list of core names that are either DOWN ore RECOVERING on 
'nodeName'
+   */
+  static List<String> findUnhealthyCores(ClusterState clusterState, String 
nodeName) {
+    return clusterState.getCollectionsMap().values().stream()
 
 Review comment:
   > maybe we want to return false if there are any replicas from inactive 
slices on the node
   
   Inactive shards are not searched, so we should not care about those. A shard 
split will not clean up the old shard, but instead mark it inactive, until user 
manually deletes those shards, or the Autoscaling framework rules go reap them. 
That is why I chose to check active shards only. We should be ok if the active 
shard(s) only are up and active, then k8s can go restart the next node.
   
   If a shard split is currently running (could be long running), on a node 
being restarted, the split would be aborted but when the node comes up again I 
believe the overseer might try again??

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to