[
https://issues.apache.org/jira/browse/GOBBLIN-1800?focusedWorklogId=858763&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-858763
]
ASF GitHub Bot logged work on GOBBLIN-1800:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 24/Apr/23 18:33
Start Date: 24/Apr/23 18:33
Worklog Time Spent: 10m
Work Description: umustafi commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1175638739
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -681,18 +681,20 @@ private void cancelDag(DagId dagId) throws
ExecutionException, InterruptedExcept
clearUpDagAction(dagId);
}
- private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel)
throws ExecutionException, InterruptedException {
+ private void cancelDagNode(DagNode<JobExecutionPlan> dagNodeToCancel,
boolean shouldSendCancellationEvent) throws ExecutionException,
InterruptedException {
Review Comment:
can you add a comment to explain why you added this boolean and what case we
don't send cancellation event even though this function is called?
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean
killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
DagManagerUtils.getFullyQualifiedDagName(node),
timeOutForJobStart);
dagManagerMetrics.incrementCountsStartSlaExceeded(node);
- cancelDagNode(node);
String dagId = DagManagerUtils.generateDagId(node).toString();
-
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);
Review Comment:
we're not calling `cancelDagNode` but still updating flow status also
Issue Time Tracking
-------------------
Worklog Id: (was: 858763)
Time Spent: 1h (was: 50m)
> GaaS does not retry SLA killed jobs
> -----------------------------------
>
> Key: GOBBLIN-1800
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1800
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-service
> Reporter: William Lo
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Gobblin-as-a-Service fails jobs when they run past their start SLA and their
> runtime SLA. It would be expected for jobs to have these SLAs retried if
> configured to retry, but they currently do not.
> The DagManager should automatically retry jobs that exceed their SLAs if the
> user configured retries, in case these flow failures are due to intermittent
> issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)