Igniters, While preparing some Ignite materials for Administrators I found Ignite is not friendly for such a critical DevOps practice as monitoring. TL;DRI think Ignite misses structured descriptions of abnormal events with references to event IDs in the logs not changing as new versions are released. MORE DETAILS I call an application “monitoring friendly” if it allows DevOps to: 1. immediately receive a notification (email, SMS, etc.) 2. understand what a problem is without involving developers 3. provide automated recovery action.
Large enterprises do not implement custom solutions. They usually use tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise consistently. All such tools have similar architecture providing a dashboard showing apps as “green/yellow/red”, and numerous “connectors” to look for events in text logs, ESBs, database tables, etc. For each app DevOps build a “health model” - a diagram displaying the app’s “manageable” components and the app boundaries. A “manageable” component is something that can be started/stopped/configured in isolation. “System boundary” is a list of external apps that the monitored app interacts with. The main attribute of a manageable component is a list of “operationally significant events”. Those are the events that DevOps can do something with. For example, “failed to connect to cache store” is significant, while “user input validation failed” is not. Events shall be as specific as possible so that DevOps do not spend time for further analysis. For example, a “database failure” event is not good. There should be “database connection failure”, “invalid database schema”, “database authentication failure”, etc. events. “Event” is NOT the same as exception occurred in the code. Events identify specific problem from the DevOps point of view. For example, even if “connection to cache store failed” exception might be thrown from several places in the code, that is still the same event. On the other side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be caught in the same place, those are different events since MS SQL Server and Oracle are usually different DevOps groups in large enterprises! The operationally significant event IDs must be stable: they must not change from one release to another. This is like a contract between developers and DevOps. This should be the developer’s responsibility to publish and maintain a table with attributes: - Event ID - Severity: Critical (Red) - the system is not operational; Warning (Yellow) - the system is operational but health is degraded; None - just an info. - Description: concise but enough for DevOps to act without developer’s help - Recovery actions: what DevOps shall do to fix the issue without developer’s help. DevOps might create automated recovery scripts based on this information. For example: 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open ignite configuration and find zookeeper connection string 2) Make sure the Zookeeper is running 10200 - Warning - Ignite node left the cluster. Back to Ignite: it looks to me we do not design for operations as described above. We have no event IDs: our logging is subject to change in new version so that any patterns DevOps might use to detect significant events would stop working after upgrade. If I am not the only one how have such concerns then we might open a ticket to address this. Best regards, Alexey