Denis, IGNITE-5620 is completely different thing. Let's do not mix cluster monitoring and parser errors.
ср, 16 авг. 2017 г. в 2:57, Denis Magda <dma...@apache.org>: > Alexey, > > Didn’t know that such an improvement as consistent IDs for errors and > events can be used as an integration point with the DevOps tools. Thanks > for sharing your experience with us. > > Would you step in as a architect for this task and make out a JIRA ticket > with all the required information. > > In general, we’ve already planned to do something around this starting > with SQL: > https://issues.apache.org/jira/browse/IGNITE-5620 < > https://issues.apache.org/jira/browse/IGNITE-5620> > > It makes sense to consider your input before the work on IGNITE-5620 is > started. > > — > Denis > > > On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin < > alexeykukush...@yahoo.com.INVALID> wrote: > > > > Hi Alexey, > > A nice thing about delegating alerting to 3rd party enterprise systems > is that those systems already deal with lots of things including > distributed apps. > > What is needed from Ignite is to consistently write to log files (again > that means stable event IDs, proper event granularity, no repetition, > documentation). This would be 3rd party monitoring system's responsibility > to monitor log files on all nodes, filter, aggregate, process, visualize > and notify on events. > > How a monitoring tool would deal with an event like "node left": > > The only thing needed from Ignite is to write an entry like below to log > files on all Ignite servers. In this example 3300 identifies this "node > left" event and will never change in the future even if text description > changes: > > [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the > cluster > > Then we document somewhere on the web that Ignite has event 3300 and it > means a node left the cluster. Maybe provide documentation how to deal with > it. Some examples:Oracle Web Cache events: > https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS > SQL Server events: > https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx > > That is all for Ignite! Everything else is handled by specific > monitoring system configured by DevOps on the customer side. > > Basing on the Ignite documentation similar to above, DevOps of a company > where Ignite is going to be used will configure their monitoring system to > understand Ignite events. Consider the "node left" event as an example. > > - This event is output on every node but DevOps do not want to be > notified many times. To address this, they will build an "Ignite model" > where there will be a parent-child dependency between components "Ignite > Cluster" and "Ignite Node". For example, this is how you do it in Nagios: > https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html > and this is how you do it in Microsoft SCSM: > https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then > DevOps will configure "node left" monitors in SCSM (or a "checks" in > Nagios) for parent "Ignite Cluster" and child "Ignite Service" components. > State change (OK -> WARNING) and notification (email, SMS, whatever) will > be configured only for the "Ignite Cluster"'s "node left" monitor.- Now > suppose a node left. The "node left" monitor (that uses log file monitoring > plugin) on "Ignite Node" will detect the event and pass it to the parent. > This will trigger "Ignite Cluster" state change from OK to WARNING and send > a notification. No more notification will be sent unless the "Ignite > Cluster" state is reset back to OK, which happens either manually or on > timeout or automatically on "node joined". > > This was just FYI. We, Ignite developers, do not care about how > monitoring works - this is responsibility of customer's DevOps. Our > responsibility is consistent event logging. > > Thank you! > > > > > > Best regards, Alexey > > > > > > On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov < > akuznet...@apache.org> wrote: > > > > Alexey, > > > > How you are going to deal with distributed nature of Ignite cluster? > > And how do you propose handle nodes restart / stop? > > > > On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin < > > alexeykukush...@yahoo.com.invalid> wrote: > > > >> Hi Denis, > >> Monitoring tools simply watch event logs for patterns (regex in case of > >> unstructured logs like text files). A stable (not changing in new > releases) > >> event ID identifying specific issue would be such a pattern. > >> We need to introduce such event IDs according to the principles I > >> described in my previous mail. > >> Best regards, Alexey > >> > >> > >> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda < > >> dma...@apache.org> wrote: > >> > >> Hello Alexey, > >> > >> Thanks for the detailed input. > >> > >> Assuming that Ignite supported the suggested events based model, how can > >> it be integrated with mentioned tools like DynaTrace or Nagios? Is this > all > >> we need? > >> > >> — > >> Denis > >> > >>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin < > alexeykukush...@yahoo.com > >> .INVALID> wrote: > >>> > >>> Igniters, > >>> While preparing some Ignite materials for Administrators I found Ignite > >> is not friendly for such a critical DevOps practice as monitoring. > >>> TL;DRI think Ignite misses structured descriptions of abnormal events > >> with references to event IDs in the logs not changing as new versions > are > >> released. > >>> MORE DETAILS > >>> I call an application “monitoring friendly” if it allows DevOps to: > >>> 1. immediately receive a notification (email, SMS, etc.) > >>> 2. understand what a problem is without involving developers > >>> 3. provide automated recovery action. > >>> > >>> Large enterprises do not implement custom solutions. They usually use > >> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the > >> enterprise consistently. All such tools have similar architecture > providing > >> a dashboard showing apps as “green/yellow/red”, and numerous > “connectors” > >> to look for events in text logs, ESBs, database tables, etc. > >>> > >>> For each app DevOps build a “health model” - a diagram displaying the > >> app’s “manageable” components and the app boundaries. A “manageable” > >> component is something that can be started/stopped/configured in > isolation. > >> “System boundary” is a list of external apps that the monitored app > >> interacts with. > >>> > >>> The main attribute of a manageable component is a list of > “operationally > >> significant events”. Those are the events that DevOps can do something > >> with. For example, “failed to connect to cache store” is significant, > while > >> “user input validation failed” is not. > >>> > >>> Events shall be as specific as possible so that DevOps do not spend > time > >> for further analysis. For example, a “database failure” event is not > good. > >> There should be “database connection failure”, “invalid database > schema”, > >> “database authentication failure”, etc. events. > >>> > >>> “Event” is NOT the same as exception occurred in the code. Events > >> identify specific problem from the DevOps point of view. For example, > even > >> if “connection to cache store failed” exception might be thrown from > >> several places in the code, that is still the same event. On the other > >> side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout > >> exceptions might be caught in the same place, those are different events > >> since MS SQL Server and Oracle are usually different DevOps groups in > large > >> enterprises! > >>> > >>> The operationally significant event IDs must be stable: they must not > >> change from one release to another. This is like a contract between > >> developers and DevOps. > >>> > >>> This should be the developer’s responsibility to publish and maintain a > >> table with attributes: > >>> > >>> - Event ID > >>> - Severity: Critical (Red) - the system is not operational; Warning > >> (Yellow) - the system is operational but health is degraded; None - > just an > >> info. > >>> - Description: concise but enough for DevOps to act without developer’s > >> help > >>> - Recovery actions: what DevOps shall do to fix the issue without > >> developer’s help. DevOps might create automated recovery scripts based > on > >> this information. > >>> > >>> For example: > >>> 10100 - Critical - Could not connect to Zookeeper to discovery nodes - > >> 1) Open ignite configuration and find zookeeper connection string 2) > Make > >> sure the Zookeeper is running > >>> 10200 - Warning - Ignite node left the cluster. > >>> > >>> Back to Ignite: it looks to me we do not design for operations as > >> described above. We have no event IDs: our logging is subject to change > in > >> new version so that any patterns DevOps might use to detect significant > >> events would stop working after upgrade. > >>> > >>> If I am not the only one how have such concerns then we might open a > >> ticket to address this. > >>> > >>> > >>> Best regards, Alexey > >> > > > > > > > > -- > > Alexey Kuznetsov > >