[
https://issues.apache.org/jira/browse/AIRFLOW-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535321#comment-16535321
]
ASF subversion and git services commented on AIRFLOW-2706:
----------------------------------------------------------
Commit 0c5ebcbd1e1b26664061f2db889748f0085d02fe in incubator-airflow's branch
refs/heads/master from [~cforster]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=0c5ebcb ]
[AIRFLOW-2706] AWS Batch Operator should use top-level job state to determine
status
Rather than inspecting the state of job attempts,
the operator should use the top-level job status
to determine the overall success or failure of the
task. This means the following cases are handled
correctly:
1. Any infrastructure failure that results in no
attempts being performed is now detected.
2. Any retry policy that AWS Batch will do is now
honored -- the job isn't marked FAILED until all
attempts to retry have failed. Previously, the
first failed *attempt* would make the task as
failed.
Closes #3567 from craigforster/master
> AWS Batch Operator doesn't detect failure if there were no job attempts
> -----------------------------------------------------------------------
>
> Key: AIRFLOW-2706
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2706
> Project: Apache Airflow
> Issue Type: Bug
> Components: aws
> Reporter: Craig Forster
> Assignee: Craig Forster
> Priority: Major
> Fix For: 2.0.0
>
>
> During initial deployment testing of our AWS Batch environment using Airflow
> to co-ordinate, we had a few false starts while we fixed IAM roles. However,
> these failed jobs weren't detected as failed by Airflow.
> I believe the issue lies in _check_success_task; the failure check loops over
> the attempts array, but in this case there are no attempts to check.
> Logs:
> {noformat}
> {awsbatch_operator.py:150} INFO - AWS Batch stopped, check status:
> {
> "ResponseMetadata": {
> "RequestId": "51084897-7d90-11e8-be75-7b511f9b010d",
> "HTTPStatusCode": 200,
> "HTTPHeaders": {
> "date": "Mon, 02 Jul 2018 00:39:02 GMT",
> "content-type": "application/json",
> "content-length": "1142",
> "connection": "keep-alive",
> "x-amzn-requestid": "51084897-7d90-11e8-be75-7b511f9b010d",
> "x-amz-apigw-id": "JX8V_HOyPHcF5KA=",
> "x-amzn-trace-id": "Root=1-5b397426-058a6d1ce4d7569273c05bd4"
> },
> "RetryAttempts": 0
> },
> "jobs": [
> {
> "jobName": "snip-20180317",
> "jobId": "2ea0def8-1e7f-4a5c-bd1e-3f0a3acc035c",
> "jobQueue":
> "arn:aws:batch:us-west-2:snip:job-queue/snip-829f351459741d3",
> "status": "FAILED",
> "attempts": [],
> "statusReason": "Role is not valid",
> "createdAt": 1530491934164,
> "retryStrategy": { "attempts": 1 },
> "dependsOn": [],
> "jobDefinition":
> "arn:aws:batch:us-west-2:snip:job-definition/snip-job-definition:4",
> "parameters": {},
> "container": {
> "image":
> "snip.dkr.ecr.eu-central-1.amazonaws.com/snip:latest",
> "vcpus": 1,
> "memory": 2048,
> "command": [],
> "jobRoleArn":
>
> "arn:aws:iam::snip:instance-profile/common-instance-profile-us2-sandbox",
> "volumes": [],
> "environment": [
> { SNIP }
> ],
> "mountPoints": [],
> "ulimits": [],
> "privileged": True
> }
> }
> ]
> }
> {awsbatch_operator.py:110} INFO - AWS Batch Job has been successfully
> executed:
> {
> "ResponseMetadata": {
> "RequestId": "4c255dd7-7d90-11e8-988b-c9ea0b25c469",
> "HTTPStatusCode": 200,
> "HTTPHeaders": {
> "date": "Mon, 02 Jul 2018 00:38:54 GMT",
> "content-type": "application/json",
> "content-length": "111",
> "connection": "keep-alive",
> "x-amzn-requestid": "4c255dd7-7d90-11e8-988b-c9ea0b25c469",
> "x-amz-apigw-id": "JX8UtH6VvHcFcVg=",
> "x-amzn-trace-id": "Root=1-5b39741e-577ea13c82751664daac335e"
> },
> "RetryAttempts": 0
> },
> "jobName": "snip-20180317",
> "jobId": "2ea0def8-1e7f-4a5c-bd1e-3f0a3acc035c"
> }
> {noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)