sbourkeostk opened a new issue #7489:
URL: https://github.com/apache/pulsar/issues/7489


   **Describe the bug**
   When an exception occurs in a Pulsar function or connector the timestamp of 
the exception is included as a metric label and the gauge value is set to 1.0. 
This has several drawbacks:
   1. It cause the brokers' metrics scrapes to continuously grow - we witnessed 
individual broker metrics scrapes of ~ 100 MB
   2. In Prometheus it results in a new time series (of length 1) for **every** 
exception thrown - in extreme cases this will cause Prometheus to run out of 
memory and fail - we experienced such a failure.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Get a Pulsar function to throw multiple exceptions
   2. Scrape the metrics for the broker involved
   3. Look for pulsar_function_user_exception metrics
   4. A gauge will exist for each exception thrown (see "screenshot" below)
   
   **Expected behavior**
   The timestamp not included in the metric, and the metric incremented for 
each exception thrown.
   
   **Screenshots**
   ```
   # HELP pulsar_function_user_exception Exception from user code.
   # TYPE pulsar_function_user_exception gauge
   
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
 message",ts="1594301177959",} 1.0
   
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
 message",ts="1594301238033",} 1.0
   
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
 message",ts="1594301266054",} 1.0
   
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
 message",ts="1594301278993",} 1.0
   
pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception
 message",ts="1594301272472",} 1.0
   ```
   
   **Desktop (please complete the following information):**
    OS: (NA)
   
   **Additional context**
   We experienced failures in our production system due to this issue. A single 
node of a distributed database being down caused function exceptions - we were 
prepared for this and it did not cause problem. However the resulting metrics 
caused our Prometheus to fail and required our brokers to be restarted to clear 
their metrics data.
   While this example is for a Pulsar function, the same is true for connectors.
   
   
https://github.com/apache/pulsar/blob/beb9e3be60513bdfbd0e412a68747b97714af1d7/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L246
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to