Re: [prometheus-users] jmx_exporter's MBean-level fetching well-intentioned but preventing required telemetry [feature request discussion]

Brian Brazil Fri, 05 Jun 2020 03:31:14 -0700

On Fri, 5 Jun 2020 at 11:13, Cameron Kerr <[email protected]> wrote:


> Hi all, I've spent a few days coming up to speed on understanding
> jmx_exporter, and I think I'm in a pretty good place now, understanding how
> MBeans, JMX, and RMI work.
>
> I've so far deployed jmx_exporter in two ways:
>
> * as a Java Agent on Tomcat 6 (on RHEL6) for a small application that
> includes SOLR and PostgreSQL
> * and as a standalone Java HTTP server on Tomcat 8.5.34 (that comes
> bundled with a mission-critical application)
>
> I found going with the Java Agent relatively easy, although I think I'll
> contribute a blog post and pull-request to help on the documentation front.
>
> You might reasonably ask why I'm bothering with the HTTP server. Here's my
> business logic that drives this:
>
> * we have a mission-critical application that we urgently need to improve
> our visibility on to diagnose some performance limits we believe we're
> reaching
> * we're reluctant to introduce (and cause an outage to introduce) a java
> agent --- as far as I'm aware jmx_exporter lacks the ability that jconsole
> has to dynamically inject an agent.
>

It's a very simple Java agent that doesn't do anything fancy. I believe
it's possible if you already know how to do such things, though using it as
an agent is almost always better.


> * as part of a previous monitoring drive, we've already introduced the
> appropriate Remote JMX configuration (-D....jmxremote... etc.), which means
> we can introduce some monitoring into our production environment and easily
> restart the JMX exporter as needed to iterate through configuration changes.
>

The JMX exporter will pick up configuration changes without restarting.


>
> We recognise that running a separate JVM has its disadvantages, namely:
> * it will incur a JVM memory overhead
> * it will likely need to be run as the same user with the same
> version/type of JVM (I'm not sure if this is accurate, but it seems safer).
> * it creates a potential hole (via RMI) in the security boundary of the
> application, so we would prefer to house this on the same server (similar
> to a 'side-car' type of deployment, I suppose)
>

The really big disadvantage is that it's much slower. You also lose some
process and JVM metrics.


>
> So most of what I'm about to say is about Remote JMX mode of operation
> (but still potentially relevant in part to Agent mode).
>
> Here's the business value I need to obtain from jmx_exporter:
>
> 1) provide telemetry we're missing to diagnose urgent and important
> production issues, particularly for database connection pools and thread
> counts (memory/garbage collection would also be useful in the general case,
> and application-specific MBeans that would be useful in specific cases,
> such as applications that use SOLR or particular frameworks that instrument
> various URL handlers with nice statistics)
> 2) impart minimal changes to application runtime or risk changing
> behaviour in mission-critical production application
> 3) impart minimal changes in performance; we don't want to induce
> unreasonable load by introducing monitoring.
>
> As I understand it, the current implementation of jmx_exporter uses a
> MBean level of querying the Attributes available within an MBean, effecting
> providing a 'batch' sort of API which reduces the number of RMI round-trips
> in the expectation that this is faster than what JConsole does by querying
> each individual Attribute (more round-trips, potentially over a remote
> connection). This does make the assumption though that the time spent (and
> value received) from querying all of the attributes is worthwhile. Let's
> see where this assumption, well-intentioned as it is, leads us in practice:
>
> I want to get telemetry around ThreadPool usage within Tomcat, so looking
> at JConsole, I see the following
>
> [image: 2020-06-05 17_38_53-RHEL Server 7 [Running] - Oracle VM
> VirtualBox.png]
>
> Great, connectionCount, currentThreadCount and currentThreadBusy look to
> be things I would definately be interested in, I'm unlikely to use most of
> the rest.
>
> Clicking on the 'http-nio-8082', I see the ObjectName being the following,
> which I put into my whitelistObjectNames
>
> Catalina:type=ThreadPool,name="http-nio-8082"
>
> So now my configuration looks something like the following:
>
> ---
> hostPort: 127.0.0.1:9090
> username:
> password:
> ssl: false
>
> lowercaseOutputLabelNames: true
> lowercaseOutputName: true
>
> # You really MUST use some whitelisting to select the bits of JMX you
> actually want.
> # You DO NOT want to querying the entire MBean tree by default, which is
> what you
> # get by default. This will likely take about 10 seconds depending and may
> have
> # unintended side-effects, such as introducing lock contention
> potentially, or
> # causing database queries to be run.
> #
> whitelistObjectNames: [
>   'Catalina:type=ThreadPool,name="http-nio-8082"'
>   ]
>
> # It's not enough to simply grab the data; we need to do something with it
> to
> # generate it into metrics, otherwise that's potentially a lot of effort
> wasted
> # getting all that raw data (you did use a whitelist, right?)
> #
> rules:
>
> # Ah, due to a bug that was fixed in Tomcat 8.5.35 (our app bundles
> 8.5.34), this results in a
> # serialization error. Because the socketProperties is not serialisable
> (it shows as 'Unavailable' in JConsole)
> # it faults the entire request for that object and returns an exception
> over the wire.
> #
> # https://bz.apache.org/bugzilla/show_bug.cgi?id=62871
> #
> - pattern: 'Catalina<type=ThreadPool,
> name="(\w+-\w+)-(\d+)"><>(currentThreadCount|currentThreadsBusy|connectionCount):'
>   name: tomcat_threadpool_$3
>   labels:
>     port: "$2"
>     protocol: "$1"
>   help: Tomcat threadpool $3
>   type: GAUGE
>
>
> (I've spoiled the story with the comment, but that's okay...)
>
> The problem (as other people have bumped into) is that Tomcat < 8.5.35,
> and other things will exhibit this behaviour also, is that ..... hang on,
> let me back up a bit to add some understanding to how this works:
>
> An MBean is essentially an object (okay, a subclass) that implements an
> Interface. Anything in Java can create MBeans; common examples being things
> like Tomcat, large libraries, and even the Java base environment itself.
> All these MBeans get registered into JMX (Java Management Extensions) which
> provide some structure and discoverability for tools like JConsole (or
> jmx_exporter). MBeans essentially expose various Attributes (methods that
> essentially 'getSomething'), Operations (other methods that might be used
> to change runtime state), and Notifications (which we completely ignore,
> along with Operations, for the purposes of jmx_exporter.
>
> JMX Exporter (in its HTTP server, external process form) connects (call it
> the 'client') to the (Tomcat) JVM ('server') over an RMI connection. This
> is effectively a form of IPC, where the client can invoke methods (RMI =
> Remote Method Invocation) on the server. So when you get the value of an
> Attribute, you are essentially calling some getSomething() method in an
> MBean. What you get from that is up to whatever implemented it (ie. you get
> a Plain-Old-Java-Object, or POJO for short). But to get from the 'server'
> over the RMI connection to the 'client' it needs to be serialised to be
> sent over the wire, deserialised at the other end, and then evaluated.
>
> Take socketProperties for example. I don't care about it; I care about
> currentThreadCount etc. But the problem with Tomcat (fixed in Tomcat
> 8.5.35, if you have the luxury of moving to that; our vendor-supplied
> application bundles Tomcat 8.5.34) is that its implementation of the
> 'getter' method for socketProperties returns something that is not
> serialisable (it doesn't implement that expected method). This becomes a
> problem at the point where it needs to be serialised, which is RMI. This
> results in an exception.
>
> Because jmx_exporter is using a method that says 'give me all the
> attributes for MBean B', that exception basically junks the whole result,
> and I lose the result of currentThreadCount etc. with it.
>
> JConsole on the other hand uses the slower-but-steadier 'tell me what
> attributes exist in MBean B' followed by a lot of 'give me Attribute A for
> MBean B', it can handle that exception (showing it as a red 'Unavailable')
>

The JMX exporter used to go attribute by attribute, switching to batch gave
a substantial speedup. That's not a change I'd be looking to undo as it'd
make the jmx exporter unusable for too many users, for the sake of a small
handful of poor JMX implementations.


>
> Now let's look at another similar case; one where there are no bugs
> present. In this example I want to get information about database
> connection pool utilisation because this is valuable information and a
> common load-related performance issue (this tends to be true of
> connection-pools in general, such as for LDAP, but you get plenty of
> third-party libraries in the JDBC space).
>
> For this you'll need to find some suitable MBeans, assuming if they are
> even visible at all; one of my studies had a Tomcat 6 deployment with
> PostgreSQL and it didn't seem to expose any MBeans that I could see, my
> other study had Tomcat 7 and the MBeans lived in a domain specific to the
> application (in this case, an online learning product called Blackboard).
>
> [image: 2020-06-05 20_40_34-RHEL Server 7 [Running] - Oracle VM
> VirtualBox.png]
>
> Note that the ClassName is org.apache.tomcat.jdbc.pool.jmx.ConnectionPool
> .... but its the application that decides where to put the MBean and what
> to use as the ObjectName, so if the application is managing its own
> connection pools (rather than using a connection-pool provided by the
> middleware), prepared to hunt around it. The ClassName does come into play
> though, because that tells us what data is inside the MBean (and helps us
> find some documentation as to what those attributes might actually mean).
>
> So let's see what attributes this fairly common class exposes for
> monitoring: There are some obvious things here we would want to measure,
> either as GAUGES or as COUNTERS, but most of it we wouldn't need or want.
> In this screenshot, remember that I'm using Remote JMX, and JConsole is
> also using Remote JMX in this instance. If you hover over the red
> Unavailable for JdbcInterceptorsAsArray, you see the exception that causes
> it to be unavailable, and it's the same exception you see in the
> jmx_exporter (or more accurately, ./jmx_prometheus_httpserver.jar) when you
> have debug logging enabled.
>
> [image: 2020-06-05 21_29_05-RHEL Server 7 [Running] - Oracle VM
> VirtualBox.png]
>
> java.rmi.UnmarshalException: error unmarshalling return; nested exception
> is:
> java.lang.ClassNotFoundException:
> org.apache.tomcat.jdbc.pool.PoolProperties$InterceptorDefinition (no
> security manager: RMI class loader disabled)
>
> Let's unpack this a bit to understand what this means: the RMI client
> (Jconsole in Remote JMX mode, or jmx_prometheus_httpserver.jar) has
> received a serialised version of a class called
> org.apache.tomcat.jdbc.pool.PoolProperties$InterceptorDefinition, and it
> needs to deserialise it to extract a value from it (eg. a string value or
> floating point value). But to do that it needs to have that class available
> somewhere. You can see that this class is specific to Tomcat and JDBC, so
> Jconsole (or jmx_prometheus_httpserver.jar) won't be likely to have that
> available.
>
> Presumably you could hunt around (a lot) and stuff a lot of things into
> the classpath of the RMI client, but that's painful, needless work (I tried
> and failed, but I'm not enough of a Tomcat wizard to know how to determine
> what classpath is present (classloaders, yay) for that webapp etc.
>
> Alternatively, I could apparently make use of 'RMI class loader' which
> sends the classes over the wire too to be loaded on the client side --- and
> also have to navigate a security manager --- that's a learning path I may
> have to attempt next.
>
> Either way, considering I have no interest in JdbcInterceptorsAsArray
> anyway, all I want is Active, Idle, Size and a few counters that bear
> critical importance for my monitoring. But if I can't get a complete result
> set, I get nothing.
>
>
> Let's recap and see how this affects the value I'm expecting to achieve:
>
> 1) provide telemetry we're missing to diagnose urgent and important
> production issues, particularly for database connection pools and thread
> counts (memory/garbage collection would also be useful in the general case,
> and application-specific MBeans that would be useful in specific cases,
> such as applications that use SOLR or particular frameworks that instrument
> various URL handlers with nice statistics)
> 2) impart minimal changes to application runtime or risk changing
> behaviour in mission-critical production application
> 3) impart minimal changes in performance; we don't want to induce
> unreasonable load by introducing monitoring.
>
> #1 is mostly unattainable either because something is not serialisable on
> the RMI server side, or is not serialisable on the RMI client side. All I
> can get are the 'nice-to-haves'.
> #2 would be met by Remote JMX; if I have to use the Agent then my
> lead-time for introducing monitoring increases, decreasing my agility and
> ability to quickly withdraw the functionality in a production environment
> without an application restart.
> #3 with appropriate whitelisting of ObjectNames we can get most of the way
> there and could reasonably scrape the metrics once a minute without fear,
> although some MBeans do become very large, particularly if they contain
> arrays, when you often only need a small handful of attributes. If we can
> scrape a smaller set however, we could achieve a higher fidelity if
> desired, which might paint a truer picture if all you have to work with are
> gauges.
>
>
> I would like to propose that we introduce one of two things:
>
> EITHER add a new attribute whitelistObjectNameAttributes that could be
> used for Jconsole-style attribute at a time (or similar; can you grab a few
> named attributes in one go?), which would allow for either the broad-brush
> or fine-brush approach to collecting the data;
>

I'd be open to considering an attribute name blacklist, if there's a sane
way to specify that.


>
> OR allow for using the slower attribute-at-a-time as either an option or
> as a fallback.
>

I'd rather not, the performance difference is just too much.


On a higher level, I'd suggest looking at ways to get metrics that don't
involve JMX. Using client_java directly as far as possible will avoid all
the fun and performance issues that JMX brings.

Brian


>
> Personally I would prefer the first option because I would much rather
> pick and choose, since I need to be familiar with what data is available
> anyway in order to use it effectively.
>
> I'm not a Java programmer (at all, but I am a bit of a polyglot and I've
> been supporting Java workloads for years) but I'd be willing to give a go
> at implementing this and submitting a pull-request if people would be
> interested in receiving one.
>
>
> PS. If anyone would like an Ansible playbook for deploying
> jmx_prometheus_httpserver.jar I'm willing to share what I have so far.
>
> PPS. If anyone has experience setting up RMI class loader, I'd love some
> tips.
>
> Thanks for reading this far, and I hope this (long) post helps people to
> understand and use jmx_exporter more effectively. Once I complete some of
> this, you can expect some documentation-related PRs
>
> Cameron
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/04481e73-465e-4815-a6a9-4697c4e930ceo%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/04481e73-465e-4815-a6a9-4697c4e930ceo%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Brian Brazil
www.robustperception.io

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHJKeLr34vDLatV-KS%3D5GH%2BRWw3aKOpk3T7bpGzEKTxNNcu6Lg%40mail.gmail.com.

Re: [prometheus-users] jmx_exporter's MBean-level fetching well-intentioned but preventing required telemetry [feature request discussion]

Reply via email to