[prometheus-users] jmx_exporter's MBean-level fetching well-intentioned but preventing required telemetry [feature request discussion]

Cameron Kerr Fri, 05 Jun 2020 03:13:27 -0700

Hi all, I've spent a few days coming up to speed on understanding 
jmx_exporter, and I think I'm in a pretty good place now, understanding how 
MBeans, JMX, and RMI work.


I've so far deployed jmx_exporter in two ways:

* as a Java Agent on Tomcat 6 (on RHEL6) for a small application that 
includes SOLR and PostgreSQL
* and as a standalone Java HTTP server on Tomcat 8.5.34 (that comes bundled 
with a mission-critical application)

I found going with the Java Agent relatively easy, although I think I'll 
contribute a blog post and pull-request to help on the documentation front.

You might reasonably ask why I'm bothering with the HTTP server. Here's my 
business logic that drives this:

* we have a mission-critical application that we urgently need to improve 
our visibility on to diagnose some performance limits we believe we're 
reaching
* we're reluctant to introduce (and cause an outage to introduce) a java 
agent --- as far as I'm aware jmx_exporter lacks the ability that jconsole 
has to dynamically inject an agent.
* as part of a previous monitoring drive, we've already introduced the 
appropriate Remote JMX configuration (-D....jmxremote... etc.), which means 
we can introduce some monitoring into our production environment and easily 
restart the JMX exporter as needed to iterate through configuration changes.

We recognise that running a separate JVM has its disadvantages, namely:
* it will incur a JVM memory overhead
* it will likely need to be run as the same user with the same version/type 
of JVM (I'm not sure if this is accurate, but it seems safer).
* it creates a potential hole (via RMI) in the security boundary of the 
application, so we would prefer to house this on the same server (similar 
to a 'side-car' type of deployment, I suppose)

So most of what I'm about to say is about Remote JMX mode of operation (but 
still potentially relevant in part to Agent mode).

Here's the business value I need to obtain from jmx_exporter:

1) provide telemetry we're missing to diagnose urgent and important 
production issues, particularly for database connection pools and thread 
counts (memory/garbage collection would also be useful in the general case, 
and application-specific MBeans that would be useful in specific cases, 
such as applications that use SOLR or particular frameworks that instrument 
various URL handlers with nice statistics)
2) impart minimal changes to application runtime or risk changing behaviour 
in mission-critical production application
3) impart minimal changes in performance; we don't want to induce 
unreasonable load by introducing monitoring.

As I understand it, the current implementation of jmx_exporter uses a MBean 
level of querying the Attributes available within an MBean, effecting 
providing a 'batch' sort of API which reduces the number of RMI round-trips 
in the expectation that this is faster than what JConsole does by querying 
each individual Attribute (more round-trips, potentially over a remote 
connection). This does make the assumption though that the time spent (and 
value received) from querying all of the attributes is worthwhile. Let's 
see where this assumption, well-intentioned as it is, leads us in practice:

I want to get telemetry around ThreadPool usage within Tomcat, so looking 
at JConsole, I see the following

[image: 2020-06-05 17_38_53-RHEL Server 7 [Running] - Oracle VM 
VirtualBox.png]

Great, connectionCount, currentThreadCount and currentThreadBusy look to be 
things I would definately be interested in, I'm unlikely to use most of the 
rest.

Clicking on the 'http-nio-8082', I see the ObjectName being the following, 
which I put into my whitelistObjectNames

Catalina:type=ThreadPool,name="http-nio-8082"

So now my configuration looks something like the following:

---
hostPort: 127.0.0.1:9090
username:
password:
ssl: false

lowercaseOutputLabelNames: true
lowercaseOutputName: true

# You really MUST use some whitelisting to select the bits of JMX you 
actually want.
# You DO NOT want to querying the entire MBean tree by default, which is 
what you
# get by default. This will likely take about 10 seconds depending and may 
have
# unintended side-effects, such as introducing lock contention potentially, 
or
# causing database queries to be run.
#
whitelistObjectNames: [
  'Catalina:type=ThreadPool,name="http-nio-8082"'
  ]

# It's not enough to simply grab the data; we need to do something with it 
to
# generate it into metrics, otherwise that's potentially a lot of effort 
wasted
# getting all that raw data (you did use a whitelist, right?)
#
rules:

# Ah, due to a bug that was fixed in Tomcat 8.5.35 (our app bundles 
8.5.34), this results in a
# serialization error. Because the socketProperties is not serialisable (it 
shows as 'Unavailable' in JConsole)
# it faults the entire request for that object and returns an exception 
over the wire.
#
# https://bz.apache.org/bugzilla/show_bug.cgi?id=62871
#
- pattern: 'Catalina<type=ThreadPool, 
name="(\w+-\w+)-(\d+)"><>(currentThreadCount|currentThreadsBusy|connectionCount):'
  name: tomcat_threadpool_$3
  labels:
    port: "$2"
    protocol: "$1"
  help: Tomcat threadpool $3
  type: GAUGE


(I've spoiled the story with the comment, but that's okay...)

The problem (as other people have bumped into) is that Tomcat < 8.5.35, and 
other things will exhibit this behaviour also, is that ..... hang on, let 
me back up a bit to add some understanding to how this works:

An MBean is essentially an object (okay, a subclass) that implements an 
Interface. Anything in Java can create MBeans; common examples being things 
like Tomcat, large libraries, and even the Java base environment itself. 
All these MBeans get registered into JMX (Java Management Extensions) which 
provide some structure and discoverability for tools like JConsole (or 
jmx_exporter). MBeans essentially expose various Attributes (methods that 
essentially 'getSomething'), Operations (other methods that might be used 
to change runtime state), and Notifications (which we completely ignore, 
along with Operations, for the purposes of jmx_exporter.

JMX Exporter (in its HTTP server, external process form) connects (call it 
the 'client') to the (Tomcat) JVM ('server') over an RMI connection. This 
is effectively a form of IPC, where the client can invoke methods (RMI = 
Remote Method Invocation) on the server. So when you get the value of an 
Attribute, you are essentially calling some getSomething() method in an 
MBean. What you get from that is up to whatever implemented it (ie. you get 
a Plain-Old-Java-Object, or POJO for short). But to get from the 'server' 
over the RMI connection to the 'client' it needs to be serialised to be 
sent over the wire, deserialised at the other end, and then evaluated.

Take socketProperties for example. I don't care about it; I care about 
currentThreadCount etc. But the problem with Tomcat (fixed in Tomcat 
8.5.35, if you have the luxury of moving to that; our vendor-supplied 
application bundles Tomcat 8.5.34) is that its implementation of the 
'getter' method for socketProperties returns something that is not 
serialisable (it doesn't implement that expected method). This becomes a 
problem at the point where it needs to be serialised, which is RMI. This 
results in an exception.

Because jmx_exporter is using a method that says 'give me all the 
attributes for MBean B', that exception basically junks the whole result, 
and I lose the result of currentThreadCount etc. with it.

JConsole on the other hand uses the slower-but-steadier 'tell me what 
attributes exist in MBean B' followed by a lot of 'give me Attribute A for 
MBean B', it can handle that exception (showing it as a red 'Unavailable')


Now let's look at another similar case; one where there are no bugs 
present. In this example I want to get information about database 
connection pool utilisation because this is valuable information and a 
common load-related performance issue (this tends to be true of 
connection-pools in general, such as for LDAP, but you get plenty of 
third-party libraries in the JDBC space).

For this you'll need to find some suitable MBeans, assuming if they are 
even visible at all; one of my studies had a Tomcat 6 deployment with 
PostgreSQL and it didn't seem to expose any MBeans that I could see, my 
other study had Tomcat 7 and the MBeans lived in a domain specific to the 
application (in this case, an online learning product called Blackboard).

[image: 2020-06-05 20_40_34-RHEL Server 7 [Running] - Oracle VM 
VirtualBox.png]

Note that the ClassName is org.apache.tomcat.jdbc.pool.jmx.ConnectionPool 
.... but its the application that decides where to put the MBean and what 
to use as the ObjectName, so if the application is managing its own 
connection pools (rather than using a connection-pool provided by the 
middleware), prepared to hunt around it. The ClassName does come into play 
though, because that tells us what data is inside the MBean (and helps us 
find some documentation as to what those attributes might actually mean).

So let's see what attributes this fairly common class exposes for 
monitoring: There are some obvious things here we would want to measure, 
either as GAUGES or as COUNTERS, but most of it we wouldn't need or want. 
In this screenshot, remember that I'm using Remote JMX, and JConsole is 
also using Remote JMX in this instance. If you hover over the red 
Unavailable for JdbcInterceptorsAsArray, you see the exception that causes 
it to be unavailable, and it's the same exception you see in the 
jmx_exporter (or more accurately, ./jmx_prometheus_httpserver.jar) when you 
have debug logging enabled.

[image: 2020-06-05 21_29_05-RHEL Server 7 [Running] - Oracle VM 
VirtualBox.png]

java.rmi.UnmarshalException: error unmarshalling return; nested exception 
is: 
java.lang.ClassNotFoundException: 
org.apache.tomcat.jdbc.pool.PoolProperties$InterceptorDefinition (no 
security manager: RMI class loader disabled)

Let's unpack this a bit to understand what this means: the RMI client 
(Jconsole in Remote JMX mode, or jmx_prometheus_httpserver.jar) has 
received a serialised version of a class called 
org.apache.tomcat.jdbc.pool.PoolProperties$InterceptorDefinition, and it 
needs to deserialise it to extract a value from it (eg. a string value or 
floating point value). But to do that it needs to have that class available 
somewhere. You can see that this class is specific to Tomcat and JDBC, so 
Jconsole (or jmx_prometheus_httpserver.jar) won't be likely to have that 
available.

Presumably you could hunt around (a lot) and stuff a lot of things into the 
classpath of the RMI client, but that's painful, needless work (I tried and 
failed, but I'm not enough of a Tomcat wizard to know how to determine what 
classpath is present (classloaders, yay) for that webapp etc.

Alternatively, I could apparently make use of 'RMI class loader' which 
sends the classes over the wire too to be loaded on the client side --- and 
also have to navigate a security manager --- that's a learning path I may 
have to attempt next.

Either way, considering I have no interest in JdbcInterceptorsAsArray 
anyway, all I want is Active, Idle, Size and a few counters that bear 
critical importance for my monitoring. But if I can't get a complete result 
set, I get nothing.


Let's recap and see how this affects the value I'm expecting to achieve:

1) provide telemetry we're missing to diagnose urgent and important 
production issues, particularly for database connection pools and thread 
counts (memory/garbage collection would also be useful in the general case, 
and application-specific MBeans that would be useful in specific cases, 
such as applications that use SOLR or particular frameworks that instrument 
various URL handlers with nice statistics)
2) impart minimal changes to application runtime or risk changing behaviour 
in mission-critical production application
3) impart minimal changes in performance; we don't want to induce 
unreasonable load by introducing monitoring.

#1 is mostly unattainable either because something is not serialisable on 
the RMI server side, or is not serialisable on the RMI client side. All I 
can get are the 'nice-to-haves'.
#2 would be met by Remote JMX; if I have to use the Agent then my lead-time 
for introducing monitoring increases, decreasing my agility and ability to 
quickly withdraw the functionality in a production environment without an 
application restart.
#3 with appropriate whitelisting of ObjectNames we can get most of the way 
there and could reasonably scrape the metrics once a minute without fear, 
although some MBeans do become very large, particularly if they contain 
arrays, when you often only need a small handful of attributes. If we can 
scrape a smaller set however, we could achieve a higher fidelity if 
desired, which might paint a truer picture if all you have to work with are 
gauges.


I would like to propose that we introduce one of two things:

EITHER add a new attribute whitelistObjectNameAttributes that could be used 
for Jconsole-style attribute at a time (or similar; can you grab a few 
named attributes in one go?), which would allow for either the broad-brush 
or fine-brush approach to collecting the data;

OR allow for using the slower attribute-at-a-time as either an option or as 
a fallback.

Personally I would prefer the first option because I would much rather pick 
and choose, since I need to be familiar with what data is available anyway 
in order to use it effectively.

I'm not a Java programmer (at all, but I am a bit of a polyglot and I've 
been supporting Java workloads for years) but I'd be willing to give a go 
at implementing this and submitting a pull-request if people would be 
interested in receiving one.


PS. If anyone would like an Ansible playbook for deploying 
jmx_prometheus_httpserver.jar I'm willing to share what I have so far.

PPS. If anyone has experience setting up RMI class loader, I'd love some 
tips.

Thanks for reading this far, and I hope this (long) post helps people to 
understand and use jmx_exporter more effectively. Once I complete some of 
this, you can expect some documentation-related PRs

Cameron

 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/04481e73-465e-4815-a6a9-4697c4e930ceo%40googlegroups.com.

[prometheus-users] jmx_exporter's MBean-level fetching well-intentioned but preventing required telemetry [feature request discussion]

Reply via email to