Hello Alexey, Thanks for the detailed input.
Assuming that Ignite supported the suggested events based model, how can it be integrated with mentioned tools like DynaTrace or Nagios? Is this all we need? — Denis > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin > <alexeykukush...@yahoo.com.INVALID> wrote: > > Igniters, > While preparing some Ignite materials for Administrators I found Ignite is > not friendly for such a critical DevOps practice as monitoring. > TL;DRI think Ignite misses structured descriptions of abnormal events with > references to event IDs in the logs not changing as new versions are released. > MORE DETAILS > I call an application “monitoring friendly” if it allows DevOps to: > 1. immediately receive a notification (email, SMS, etc.) > 2. understand what a problem is without involving developers > 3. provide automated recovery action. > > Large enterprises do not implement custom solutions. They usually use tools > like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise > consistently. All such tools have similar architecture providing a dashboard > showing apps as “green/yellow/red”, and numerous “connectors” to look for > events in text logs, ESBs, database tables, etc. > > For each app DevOps build a “health model” - a diagram displaying the app’s > “manageable” components and the app boundaries. A “manageable” component is > something that can be started/stopped/configured in isolation. “System > boundary” is a list of external apps that the monitored app interacts with. > > The main attribute of a manageable component is a list of “operationally > significant events”. Those are the events that DevOps can do something with. > For example, “failed to connect to cache store” is significant, while “user > input validation failed” is not. > > Events shall be as specific as possible so that DevOps do not spend time for > further analysis. For example, a “database failure” event is not good. There > should be “database connection failure”, “invalid database schema”, “database > authentication failure”, etc. events. > > “Event” is NOT the same as exception occurred in the code. Events identify > specific problem from the DevOps point of view. For example, even if > “connection to cache store failed” exception might be thrown from several > places in the code, that is still the same event. On the other side, even if > a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be > caught in the same place, those are different events since MS SQL Server and > Oracle are usually different DevOps groups in large enterprises! > > The operationally significant event IDs must be stable: they must not change > from one release to another. This is like a contract between developers and > DevOps. > > This should be the developer’s responsibility to publish and maintain a table > with attributes: > > - Event ID > - Severity: Critical (Red) - the system is not operational; Warning (Yellow) > - the system is operational but health is degraded; None - just an info. > - Description: concise but enough for DevOps to act without developer’s help > - Recovery actions: what DevOps shall do to fix the issue without developer’s > help. DevOps might create automated recovery scripts based on this > information. > > For example: > 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) > Open ignite configuration and find zookeeper connection string 2) Make sure > the Zookeeper is running > 10200 - Warning - Ignite node left the cluster. > > Back to Ignite: it looks to me we do not design for operations as described > above. We have no event IDs: our logging is subject to change in new version > so that any patterns DevOps might use to detect significant events would stop > working after upgrade. > > If I am not the only one how have such concerns then we might open a ticket > to address this. > > > Best regards, Alexey