Re: [PR] Fix issue creation for grafana alerts [beam]

via GitHub Thu, 29 Feb 2024 06:26:35 -0800


damccorm commented on code in PR #30424:
URL: https://github.com/apache/beam/pull/30424#discussion_r1507673562



##########
.test-infra/metrics/sync/github/github_runs_prefetcher/code/main.py:
##########
@@ -143,6 +144,41 @@ def enhance_workflow(workflow):
         print(f"No yaml file found for workflow: {workflow.name}")
 
 
+async def check_workflow_flakiness(workflow):
+    def filter_workflow_runs(run, issue):
+        started_at = datetime.strptime(run.started_at, "%Y-%m-%dT%H:%M:%SZ")
+        closed_at = datetime.strptime(issue["closed_at"], "%Y-%m-%dT%H:%M:%SZ")
+        if started_at > closed_at:
+            return True
+        return False
+
+    if not len(workflow.runs):
+        return False
+
+    url = f"https://api.github.com/repos/{GIT_ORG}/beam/issues";
+    headers = {"Authorization": get_token()}
+    semaphore = asyncio.Semaphore(5)
+    workflow_runs = workflow.runs
+    params = {
+        "state": "closed",
+        "labels": f"flaky_test,workflow_id: {workflow.id}",
+    }
+    response = await fetch(url, semaphore, params, headers)
+    if len(response):
+        print(f"Found a recently closed issue for the {workflow.name} 
workflow")
+        workflow_runs = [run for run in workflow_runs if 
filter_workflow_runs(run, response[0])]
+
+    print(f"Number of workflow runs to consider: {len(workflow_runs)}")
+    success_rate = 1.0
+    if len(workflow_runs):
+        failed_runs = list(filter(lambda r: r.status == "failure", 
workflow_runs))
+        print(f"Number of failed workflow runs: {len(failed_runs)}")
+        success_rate -= (len(failed_runs) / len(workflow_runs))
+
+    print(f"Success rate: {success_rate}")
+    return True if success_rate < workflow.threshold else False

Review Comment:
   I think I'm more worried about the case where:
   
   1) Change to fix the workflow goes in, but maybe all pieces aren't fully 
deployed or something like that where the change isn't instantaneous. This 
would depend on the workflow.
   2) We go f, s, s, s, s, ....
   
   Right now, this would fire immediately after the first failure. On the other 
hand, if we go:
   
   f, f, f, f, f ...., I'd rather fire after the second or third failure 
instead of waiting 5 or 10 runs. Maybe just waiting for 2 failures is enough 
(vs 3), but either way gating on # failures instead of # runs gets us to fire a 
little faster.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix issue creation for grafana alerts [beam]

Reply via email to