zstan commented on code in PR #13130: URL: https://github.com/apache/ignite/pull/13130#discussion_r3239303641
########## docs/_docs/perf-and-troubleshooting/general-perf-tips.adoc: ########## @@ -47,3 +47,20 @@ queries with JOINs at massive scale and expect significant performance benefits. * Adjust link:data-rebalancing[data rebalancing settings] to ensure that rebalancing completes faster when your cluster topology changes. +== What healthy cluster behavior looks like + +A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score. + +When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems. Review Comment: usually ACTIVE - need cross link to ACTIVE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
