Re: Ignite not friendly for Monitoring

Vladimir Ozerov Tue, 15 Aug 2017 23:11:56 -0700

Denis,

IGNITE-5620 is completely different thing. Let's do not mix cluster
monitoring and parser errors.


ср, 16 авг. 2017 г. в 2:57, Denis Magda <dma...@apache.org>:

> Alexey,
>
> Didn’t know that such an improvement as consistent IDs for errors and
> events can be used as an integration point with the DevOps tools. Thanks
> for sharing your experience with us.
>
> Would you step in as a architect for this task and make out a JIRA ticket
> with all the required information.
>
> In general, we’ve already planned to do something around this starting
> with SQL:
> https://issues.apache.org/jira/browse/IGNITE-5620 <
> https://issues.apache.org/jira/browse/IGNITE-5620>
>
> It makes sense to consider your input before the work on IGNITE-5620 is
> started.
>
> —
> Denis
>
> > On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> alexeykukush...@yahoo.com.INVALID> wrote:
> >
> > Hi Alexey,
> > A nice thing about delegating alerting to 3rd party enterprise systems
> is that those systems already deal with lots of things including
> distributed apps.
> > What is needed from Ignite is to consistently write to log files (again
> that means stable event IDs, proper event granularity, no repetition,
> documentation). This would be 3rd party monitoring system's responsibility
> to monitor log files on all nodes, filter, aggregate, process, visualize
> and notify on events.
> > How a monitoring tool would deal with an event like "node left":
> > The only thing needed from Ignite is to write an entry like below to log
> files on all Ignite servers. In this example 3300 identifies this "node
> left" event and will never change in the future even if text description
> changes:
> > [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> cluster
> > Then we document somewhere on the web that Ignite has event 3300 and it
> means a node left the cluster. Maybe provide documentation how to deal with
> it. Some examples:Oracle Web Cache events:
> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS
> SQL Server events:
> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > That is all for Ignite! Everything else is handled by specific
> monitoring system configured by DevOps on the customer side.
> > Basing on the Ignite documentation similar to above, DevOps of a company
> where Ignite is going to be used will configure their monitoring system to
> understand Ignite events. Consider the "node left" event as an example.
> > - This event is output on every node but DevOps do not want to be
> notified many times. To address this, they will build an "Ignite model"
> where there will be a parent-child dependency between components "Ignite
> Cluster" and "Ignite Node". For example, this is how you do it in Nagios:
> https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
> and this is how you do it in Microsoft SCSM:
> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> Nagios) for parent "Ignite Cluster" and child "Ignite Service" components.
> State change (OK -> WARNING) and notification (email, SMS, whatever) will
> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> suppose a node left. The "node left" monitor (that uses log file monitoring
> plugin) on "Ignite Node" will detect the event and pass it to the parent.
> This will trigger "Ignite Cluster" state change from OK to WARNING and send
> a notification. No more notification will be sent unless the "Ignite
> Cluster" state is reset back to OK, which happens either manually or on
> timeout or automatically on "node joined".
> > This was just FYI. We, Ignite developers, do not care about how
> monitoring works - this is responsibility of customer's DevOps. Our
> responsibility is consistent event logging.
> > Thank you!
> >
> >
> > Best regards, Alexey
> >
> >
> > On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> akuznet...@apache.org> wrote:
> >
> > Alexey,
> >
> > How you are going to deal with distributed nature of Ignite cluster?
> > And how do you propose handle nodes restart / stop?
> >
> > On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > alexeykukush...@yahoo.com.invalid> wrote:
> >
> >> Hi Denis,
> >> Monitoring tools simply watch event logs for patterns (regex in case of
> >> unstructured logs like text files). A stable (not changing in new
> releases)
> >> event ID identifying specific issue would be such a pattern.
> >> We need to introduce such event IDs according to the principles I
> >> described in my previous mail.
> >> Best regards, Alexey
> >>
> >>
> >> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> >> dma...@apache.org> wrote:
> >>
> >> Hello Alexey,
> >>
> >> Thanks for the detailed input.
> >>
> >> Assuming that Ignite supported the suggested events based model, how can
> >> it be integrated with mentioned tools like DynaTrace or Nagios? Is this
> all
> >> we need?
> >>
> >> —
> >> Denis
> >>
> >>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> alexeykukush...@yahoo.com
> >> .INVALID> wrote:
> >>>
> >>> Igniters,
> >>> While preparing some Ignite materials for Administrators I found Ignite
> >> is not friendly for such a critical DevOps practice as monitoring.
> >>> TL;DRI think Ignite misses structured descriptions of abnormal events
> >> with references to event IDs in the logs not changing as new versions
> are
> >> released.
> >>> MORE DETAILS
> >>> I call an application “monitoring friendly” if it allows DevOps to:
> >>> 1. immediately receive a notification (email, SMS, etc.)
> >>> 2. understand what a problem is without involving developers
> >>> 3. provide automated recovery action.
> >>>
> >>> Large enterprises do not implement custom solutions. They usually use
> >> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> >> enterprise consistently. All such tools have similar architecture
> providing
> >> a dashboard showing apps as “green/yellow/red”, and numerous
> “connectors”
> >> to look for events in text logs, ESBs, database tables, etc.
> >>>
> >>> For each app DevOps build a “health model” - a diagram displaying the
> >> app’s “manageable” components and the app boundaries. A “manageable”
> >> component is something that can be started/stopped/configured in
> isolation.
> >> “System boundary” is a list of external apps that the monitored app
> >> interacts with.
> >>>
> >>> The main attribute of a manageable component is a list of
> “operationally
> >> significant events”. Those are the events that DevOps can do something
> >> with. For example, “failed to connect to cache store” is significant,
> while
> >> “user input validation failed” is not.
> >>>
> >>> Events shall be as specific as possible so that DevOps do not spend
> time
> >> for further analysis. For example, a “database failure” event is not
> good.
> >> There should be “database connection failure”, “invalid database
> schema”,
> >> “database authentication failure”, etc. events.
> >>>
> >>> “Event” is NOT the same as exception occurred in the code. Events
> >> identify specific problem from the DevOps point of view. For example,
> even
> >> if “connection to cache store failed” exception might be thrown from
> >> several places in the code, that is still the same event. On the other
> >> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout
> >> exceptions might be caught in the same place, those are different events
> >> since MS SQL Server and Oracle are usually different DevOps groups in
> large
> >> enterprises!
> >>>
> >>> The operationally significant event IDs must be stable: they must not
> >> change from one release to another. This is like a contract between
> >> developers and DevOps.
> >>>
> >>> This should be the developer’s responsibility to publish and maintain a
> >> table with attributes:
> >>>
> >>> - Event ID
> >>> - Severity: Critical (Red) - the system is not operational; Warning
> >> (Yellow) - the system is operational but health is degraded; None -
> just an
> >> info.
> >>> - Description: concise but enough for DevOps to act without developer’s
> >> help
> >>> - Recovery actions: what DevOps shall do to fix the issue without
> >> developer’s help. DevOps might create automated recovery scripts based
> on
> >> this information.
> >>>
> >>> For example:
> >>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes -
> >> 1) Open ignite configuration and find zookeeper connection string 2)
> Make
> >> sure the Zookeeper is running
> >>> 10200 - Warning - Ignite node left the cluster.
> >>>
> >>> Back to Ignite: it looks to me we do not design for operations as
> >> described above. We have no event IDs: our logging is subject to change
> in
> >> new version so that any patterns DevOps might use to detect significant
> >> events would stop working after upgrade.
> >>>
> >>> If I am not the only one how have such concerns then we might open a
> >> ticket to address this.
> >>>
> >>>
> >>> Best regards, Alexey
> >>
> >
> >
> >
> > --
> > Alexey Kuznetsov
>
>

Re: Ignite not friendly for Monitoring

Reply via email to