Igniters,
While preparing some Ignite materials for Administrators I found Ignite is not 
friendly for such a critical DevOps practice as monitoring. 
TL;DRI think Ignite misses structured descriptions of abnormal events with 
references to event IDs in the logs not changing as new versions are released.
MORE DETAILS
I call an application “monitoring friendly” if it allows DevOps to:
1. immediately receive a notification (email, SMS, etc.)
2. understand what a problem is without involving developers 
3. provide automated recovery action.

Large enterprises do not implement custom solutions. They usually use tools 
like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise 
consistently. All such tools have similar architecture providing a dashboard 
showing apps as “green/yellow/red”, and numerous “connectors” to look for 
events in text logs, ESBs, database tables, etc.

For each app DevOps build a “health model” - a diagram displaying the app’s 
“manageable” components and the app boundaries. A “manageable” component is 
something that can be started/stopped/configured in isolation. “System 
boundary” is a list of external apps that the monitored app interacts with.

The main attribute of a manageable component is a list of “operationally 
significant events”. Those are the events that DevOps can do something with. 
For example, “failed to connect to cache store” is significant, while “user 
input validation failed” is not.

Events shall be as specific as possible so that DevOps do not spend time for 
further analysis. For example, a “database failure” event is not good. There 
should be “database connection failure”, “invalid database schema”, “database 
authentication failure”, etc. events.  

“Event” is NOT the same as exception occurred in the code. Events identify 
specific problem from the DevOps point of view. For example, even if 
“connection to cache store failed” exception might be thrown from several 
places in the code, that is still the same event. On the other side, even if a 
SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be 
caught in the same place, those are different events since MS SQL Server and 
Oracle are usually different DevOps groups in large enterprises!

The operationally significant event IDs must be stable: they must not change 
from one release to another. This is like a contract between developers and 
DevOps.

This should be the developer’s responsibility to publish and maintain a table 
with attributes:
 
- Event ID
- Severity: Critical (Red) - the system is not operational; Warning (Yellow) - 
the system is operational but health is degraded; None - just an info.
- Description: concise but enough for DevOps to act without developer’s help
- Recovery actions: what DevOps shall do to fix the issue without developer’s 
help. DevOps might create automated recovery scripts based on this information.

For example:
10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open 
ignite configuration and find zookeeper connection string 2) Make sure the 
Zookeeper is running
10200 - Warning - Ignite node left the cluster.

Back to Ignite: it looks to me we do not design for operations as described 
above. We have no event IDs: our logging is subject to change in new version so 
that any patterns DevOps might use to detect significant events would stop 
working after upgrade.

If I am not the only one how have such concerns then we might open a ticket to 
address this.


Best regards, Alexey

Reply via email to