Hi Alexey,
A nice thing about delegating alerting to 3rd party enterprise systems is that 
those systems already deal with lots of things including distributed apps.
What is needed from Ignite is to consistently write to log files (again that 
means stable event IDs, proper event granularity, no repetition, 
documentation). This would be 3rd party monitoring system's responsibility to 
monitor log files on all nodes, filter, aggregate, process, visualize and 
notify on events.
How a monitoring tool would deal with an event like "node left":
The only thing needed from Ignite is to write an entry like below to log files 
on all Ignite servers. In this example 3300 identifies this "node left" event 
and will never change in the future even if text description changes:
[2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
Then we document somewhere on the web that Ignite has event 3300 and it means a 
node left the cluster. Maybe provide documentation how to deal with it. Some 
examples:Oracle Web Cache events: 
https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS 
SQL Server events: 
https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
That is all for Ignite! Everything else is handled by specific monitoring 
system configured by DevOps on the customer side. 
Basing on the Ignite documentation similar to above, DevOps of a company where 
Ignite is going to be used will configure their monitoring system to understand 
Ignite events. Consider the "node left" event as an example.
- This event is output on every node but DevOps do not want to be notified many 
times. To address this, they will build an "Ignite model" where there will be a 
parent-child dependency between components "Ignite Cluster" and "Ignite Node". 
For example, this is how you do it in Nagios: 
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
 and this is how you do it in Microsoft SCSM: 
https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then DevOps 
will configure "node left" monitors in SCSM (or a "checks" in Nagios) for 
parent "Ignite Cluster" and child "Ignite Service" components. State change (OK 
-> WARNING) and notification (email, SMS, whatever) will be configured only for 
the "Ignite Cluster"'s "node left" monitor.- Now suppose a node left. The "node 
left" monitor (that uses log file monitoring plugin) on "Ignite Node" will 
detect the event and pass it to the parent. This will trigger "Ignite Cluster" 
state change from OK to WARNING and send a notification. No more notification 
will be sent unless the "Ignite Cluster" state is reset back to OK, which 
happens either manually or on timeout or automatically on "node joined". 
This was just FYI. We, Ignite developers, do not care about how monitoring 
works - this is responsibility of customer's DevOps. Our responsibility is 
consistent event logging.
Thank you!


Best regards, Alexey


On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov 
<akuznet...@apache.org> wrote:

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukush...@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dma...@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukush...@yahoo.com
> .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text logs, ESBs, database tables, etc.
> >
> > For each app DevOps build a “health model” - a diagram displaying the
> app’s “manageable” components and the app boundaries. A “manageable”
> component is something that can be started/stopped/configured in isolation.
> “System boundary” is a list of external apps that the monitored app
> interacts with.
> >
> > The main attribute of a manageable component is a list of “operationally
> significant events”. Those are the events that DevOps can do something
> with. For example, “failed to connect to cache store” is significant, while
> “user input validation failed” is not.
> >
> > Events shall be as specific as possible so that DevOps do not spend time
> for further analysis. For example, a “database failure” event is not good.
> There should be “database connection failure”, “invalid database schema”,
> “database authentication failure”, etc. events.
> >
> > “Event” is NOT the same as exception occurred in the code. Events
> identify specific problem from the DevOps point of view. For example, even
> if “connection to cache store failed” exception might be thrown from
> several places in the code, that is still the same event. On the other
> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> exceptions might be caught in the same place, those are different events
> since MS SQL Server and Oracle are usually different DevOps groups in large
> enterprises!
> >
> > The operationally significant event IDs must be stable: they must not
> change from one release to another. This is like a contract between
> developers and DevOps.
> >
> > This should be the developer’s responsibility to publish and maintain a
> table with attributes:
> >
> > - Event ID
> > - Severity: Critical (Red) - the system is not operational; Warning
> (Yellow) - the system is operational but health is degraded; None - just an
> info.
> > - Description: concise but enough for DevOps to act without developer’s
> help
> > - Recovery actions: what DevOps shall do to fix the issue without
> developer’s help. DevOps might create automated recovery scripts based on
> this information.
> >
> > For example:
> > 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> 1) Open ignite configuration and find zookeeper connection string 2) Make
> sure the Zookeeper is running
> > 10200 - Warning - Ignite node left the cluster.
> >
> > Back to Ignite: it looks to me we do not design for operations as
> described above. We have no event IDs: our logging is subject to change in
> new version so that any patterns DevOps might use to detect significant
> events would stop working after upgrade.
> >
> > If I am not the only one how have such concerns then we might open a
> ticket to address this.
> >
> >
> > Best regards, Alexey
>



-- 
Alexey Kuznetsov

Reply via email to