On Mon, Jun 25, 2012 at 11:16 AM, K.C. Ramakrishna <[email protected]> wrote: > > Hi all, > > We are trying to look into monitoring servers in Beta and Prod environments. > > The stack: > a. 2(more in future) Front End apache httpd/tomcat/Liferay Servers > b. Independent CAS-SSO and SOLR servers. > c. standalone server running webservices (written in Java) > d. 2MySQL (Percona) servers 1 WRITE server and both READ. > e. Front end Hardware Load Balancers. > > We want to monitor continuously: > a. All the Linux boxes and the services running on them, > b. The performance (and history) of the Java applications too. > > We are exploring the best ways to monitor the whole setup including alerts > and automated restarts etc. We are primarily from a development background > with only basic admin experience (basic bash, installations, tuning etc). > > I have researched all the usual suspects: Cacti, Nagios, ZenOSS, Zabix?.. > JMX seems to be very popular for tomcat/Java. > > What are your recommendations for handling this scenario? What are the pros > and cons of various tools and approaches? > Please do share your thoughts on what will be a good solution. All live > examples will be very welcome in educating us.
All the tools you've mentioned above are "systems level" monitoring applications. While they are also useful, easy to setup and also needed, they can only give you so much information about your applications by measuring information about the stuff that is outside (like the JVM's / the OS's internals). The problem being that the application we care about is treated like a Black Box. This may be useful if we had a well behaving / well understood black box. But usually we don't - even if we do, it changes. I've found the following general pattern of practical "Dev Ops" problems occur frequently at work: Scenario#1: * The ops says "From <insert system monitoring tool>, we see that network I/O has increased since last launch. We believe that performance can increase if our application reduced excessive fetches and instead chose to cache or preload" * The dev says "Well, you guys seem to be running <insert operational monitoring / management tool> on *your* machines for management. It is not the problem of our application. We are not at fault! We have done nothing that increases reads. We deny it all!" (replace network I/O with anything that cannot be pin pointed) Scenario#2: * The ops says "From <insert system monitoring tool>, we see that there was a CPU spike. Any idea what happened? Here are the application's logs from the time" * Dev says "We don't know which function/component/part caused it. We will try to reproduce it in the lab." (and usually, no lab can be as hairy as reality) At the last startup I worked (which is no longer a startup and was serving close to half a million requests per second as of 6 months ago), we zero'ed in on mondemand to do white box metrics measurement to tackle the above mentioned frequently recurring scenarios. See http://mondemand.org/ . We chose LWES as the transport, but that was based on our setup. YMMV. mondemand requires instrumenting one's code. mondemand also requires one-time investment in effort towards collectively brainstorming about what metrics we want to measure and how we will measure it by putting the app and business in focus. But once done, the pay off is self evident due to the black box turning into a 'white box'. In terms of the above mentioned system tools (zabbix, nagios, etc.,) - it is my opinion that the incremental advantage over each other may be negligible since almost all of them provide extensibility (plugins / extensions). The distinct advantages of mondemand are: 1. when the app is Java and the system is Unix, there is a large gap created between the "dev" and the "ops". Unix'y things like signals, controlling I/O streams (logging), controlling priority or even measuring memory used, etc., are difficult to achieve. Yes, JMX can help, somewhat, but again JMX is a means of measuring/controlling the "Java system" (not the app). 2. As mentioned above, white box measurement is the biggest gain. 3. App can react to outside events - not only within the system but also networked events (ex: if you restarted the DB, the DB restart "script" could emit an event and the app can "react" by reestablishing the connection) Yes, a small performance penalty for all of this - but, IMHO, it is worth it. :) cheers, -Suraj -- Career Gear - Industry Driven Talent Factory http://careergear.in/ _______________________________________________ ILUGC Mailing List: http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
