sbourkeostk opened a new issue #7489:
URL: https://github.com/apache/pulsar/issues/7489
**Describe the bug**
When an exception occurs in a Pulsar function or connector the timestamp of
the exception is included as a metric label and the gauge value is set to 1.0.
This has several drawbacks:
1. It cause the brokers' metrics scrapes to continuously grow - we witnessed
individual broker metrics scrapes of ~ 100 MB
2. In Prometheus it results in a new time series (of length 1) for **every**
exception thrown - in extreme cases this will cause Prometheus to run out of
memory and fail - we experienced such a failure.
**To Reproduce**
Steps to reproduce the behavior:
1. Get a Pulsar function to throw multiple exceptions
2. Scrape the metrics for the broker involved
3. Look for pulsar_function_user_exception metrics
4. A gauge will exist for each exception thrown (see "screenshot" below)
**Expected behavior**
The timestamp not included in the metric, and the metric incremented for
each exception thrown.
**Screenshots**
```
# HELP pulsar_function_user_exception Exception from user code.
# TYPE pulsar_function_user_exception gauge
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
message",ts="1594301177959",} 1.0
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
message",ts="1594301238033",} 1.0
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
message",ts="1594301266054",} 1.0
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
message",ts="1594301278993",} 1.0
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
message",ts="1594301272472",} 1.0
```
**Desktop (please complete the following information):**
OS: (NA)
**Additional context**
We experienced failures in our production system due to this issue. A single
node of a distributed database being down caused function exceptions - we were
prepared for this and it did not cause problem. However the resulting metrics
caused our Prometheus to fail and required our brokers to be restarted to clear
their metrics data.
While this example is for a Pulsar function, the same is true for connectors.
https://github.com/apache/pulsar/blob/beb9e3be60513bdfbd0e412a68747b97714af1d7/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L246
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]