[jira] [Updated] (FLINK-24514) Incorrect Flink Metrics on Job-Internal-Restart:

Alok Singh (Jira) Tue, 12 Oct 2021 05:38:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-24514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alok Singh updated FLINK-24514:
-------------------------------
    Description: 
We have been seeing metrics showing multi-folded values after Flink Job 
restarts (due to some internal exceptions for example something like while 
deployment, the job didn't get the Task Managers in time and then it restarted 
on its own.)

Metrics implementation:
 # We have done metrics implementation using Meter.
 # We are using Accumulators.scala to define our metrics name as Value and use 
this as key and MeterView as value to define it under a Map in 
CustomMetrics.scala.
 # For MeterView object creation, we use object of AtomicLongCounter.scala 
class which extends Counter interface and override its methods. (Attached code 
files for the same to understand better)
 # We register the metrics inside FilterReportsForSummaryAnalysis.scala.

Some points to remember:
 # Not all job internal restarts cause incorrect metrics.
 # When there are internal job-restarts which caused incorrect metrics, then if 
we manually restart the job (Killing it and restarting using or not using 
savepoints), the metrics show correct value after this manual restart.(Given 
that on manual restarts, no other potential exception happened again which 
could cause an internal restarts)
 # We are using Flink Delay Restart Strategy.

*Need resolution of the issue of incorrect Flink metrics on flink 
job-internal-restarts and find out the root cause.*

  was:
We have been seeing metrics showing multi-folded values after Flink Job 
restarts (due to some internal exceptions for example something like while 
deployment, the job didn't get the Task Managers in time and then it restarted 
on its own.)

Metrics implementation:
 # We have done metrics implementation using Meter.
 # We are using Accumulators.scala to define our metrics name as Value and use 
this as key and MeterView as value to define it under a Map in 
CustomMetrics.scala.
 # For MeterView object creation, we use object of AtomicLongCounter.scala 
class which extends Counter interface and override its methods. (Attached code 
files for the same to understand better)
 # We register the metrics inside FilterReportsForSummaryAnalysis.scala.

Some points to remember:
 # Not all job internal restarts cause incorrect metrics.
 # When there are internal job-restarts which caused incorrect metrics, then if 
we manually restart the job (Killing it and restarting using or not using 
savepoints), the metrics show correct value after this manual restart.(Given 
that on manual restarts, no other potential exception happened again which 
could cause an internal restarts)
 # We are using Flink Delay Restart Strategy.


> Incorrect Flink Metrics on Job-Internal-Restart:
> ------------------------------------------------
>
>                 Key: FLINK-24514
>                 URL: https://issues.apache.org/jira/browse/FLINK-24514
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.12.1
>            Reporter: Alok Singh
>            Priority: Major
>         Attachments: Screenshot 2021-10-12 at 4.46.49 PM.png, Screenshot 
> 2021-10-12 at 4.47.17 PM.png, Screenshot 2021-10-12 at 4.47.29 PM.png, 
> Screenshot 2021-10-12 at 4.47.41 PM.png
>
>
> We have been seeing metrics showing multi-folded values after Flink Job 
> restarts (due to some internal exceptions for example something like while 
> deployment, the job didn't get the Task Managers in time and then it 
> restarted on its own.)
> Metrics implementation:
>  # We have done metrics implementation using Meter.
>  # We are using Accumulators.scala to define our metrics name as Value and 
> use this as key and MeterView as value to define it under a Map in 
> CustomMetrics.scala.
>  # For MeterView object creation, we use object of AtomicLongCounter.scala 
> class which extends Counter interface and override its methods. (Attached 
> code files for the same to understand better)
>  # We register the metrics inside FilterReportsForSummaryAnalysis.scala.
> Some points to remember:
>  # Not all job internal restarts cause incorrect metrics.
>  # When there are internal job-restarts which caused incorrect metrics, then 
> if we manually restart the job (Killing it and restarting using or not using 
> savepoints), the metrics show correct value after this manual restart.(Given 
> that on manual restarts, no other potential exception happened again which 
> could cause an internal restarts)
>  # We are using Flink Delay Restart Strategy.
> *Need resolution of the issue of incorrect Flink metrics on flink 
> job-internal-restarts and find out the root cause.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-24514) Incorrect Flink Metrics on Job-Internal-Restart:

Reply via email to