[I] RdsDeleteDbInstanceOperator sometimes does not complete [airflow]

via GitHub Thu, 09 Nov 2023 13:36:37 -0800


cliebBS opened a new issue, #35563:
URL: https://github.com/apache/airflow/issues/35563


   ### Apache Airflow version
   
   2.7.3
   
   ### What happened
   
   When using the `RdsDeleteDbInstanceOperator` to delete an RDS instance in my 
DAG, sometimes it misses the fact that the instance was deleted and leaves logs 
like: 
   
   ```
   [2023-11-09, 19:32:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
deleting
   [2023-11-09, 19:32:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
deleting
   [2023-11-09, 19:33:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
deleting
   [2023-11-09, 19:33:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   [2023-11-09, 19:34:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   [2023-11-09, 19:34:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   [2023-11-09, 19:35:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   [2023-11-09, 19:35:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   [2023-11-09, 19:36:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
   ```
   
   In the above example, at some point between 19:33:14 and 19:33:44, the 
actual instance was deleted, but the operator doesn't realize that it was 
deleted and instead continues to poll for the status of the RDS instance until 
it reaches the `waiter_max_attempts`, at which point it fails.  Retries of the 
operator exit immediately with the log message:
   
   ```
   [2023-11-09, 19:57:11 UTC] {taskinstance.py:1937} ERROR - [findings_sync 
manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 2] Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/rds.py",
 line 690, in execute
       delete_db_instance = self.hook.conn.delete_db_instance(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 
535, in _api_call
       return self._make_api_call(operation_name, kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 
980, in _make_api_call
       raise error_class(parsed_response, operation_name)
   botocore.exceptions.ClientError: An error occurred (AccessDenied) when 
calling the DeleteDBInstance operation: User: 
arn:aws:sts::account-num:assumed-role/an-iam-role/botocore-session is not 
authorized to perform: rds:DeleteDBInstance on resource: 
arn:aws:rds:us-east-1:account-num:db:test-db because no identity-based policy 
allows the rds:DeleteDBInstance action
   ```
   
   I /think/ what's happening is that the deletion is completing in the time 
between when the waiter times out and when it reschedules to run again.  Since 
the RDS delete operation doesn't leave any sign of the DB in the AWS API 
(unlike a terminated EMR cluster, for example), so running the waiter again 
leaves the operator in a weird state where it thinks the resource exists, but 
is unable to get a status for it.
   
   ### What you think should happen instead
   
   The `RdsDeleteDbInstanceOperator` should always complete in the success 
state when the RDS instance is deleted.
   
   ### How to reproduce
   
   ```python
   import datetime
   from typing import Any
   
   from airflow import DAG
   from airflow.decorators import task
   from airflow.providers.amazon.aws.operators.rds import (
       RdsCreateDbInstanceOperator,
       RdsDeleteDbInstanceOperator,
   )
   
   NUM_INSTANCES = 6  # set this higher to make it easier to reproduce in a 
single run
   
   # TODO: set these appropriately for your environment
   INSTANCE_PREFIX = "rds-delete-repro"
   
   # I don't think we can make this any smaller
   STORAGE_SIZE = 20
   INSTANCE_TYPE = "db.t4g.micro"
   
   
   @task()
   def get_names(num_instances: int) -> list[str]:
       return [f"{INSTANCE_PREFIX}-{x}" for x in range(num_instances)]
   
   
   @task()
   def get_rds_kwargs(names: list[str]) -> list[dict[str, Any]]:
       rds_kwargs: dict[str, Any] = {
           "EngineVersion": "14.8",
           "DBParameterGroupName": "default.postgres14",
           "MasterUsername": "postgres",
           "MasterUserPassword": "postgres",
           "StorageType": "gp3",
           "AllocatedStorage": STORAGE_SIZE,
           "Port": 5432,
           "NetworkType": "IPV4",
           "MultiAZ": False,
           "BackupRetentionPeriod": 0,  # disables automatic backup.
           "PubliclyAccessible": False,
           "AutoMinorVersionUpgrade": False,
           "DeletionProtection": False,
           "StorageEncrypted": True,
           # TODO: set these to valid values for your AWS account
           # "VpcSecurityGroupIds": ["sg-0ced89fe13d1ae49a"],
           # "DBSubnetGroupName": "dbsubnet-rds-dev-eks",
       }
   
       return [{"db_instance_identifier": name, "rds_kwargs": rds_kwargs} for 
name in names]
   
   
   with DAG(
       dag_id="rds_delete_repro",
       default_args={"retries": 1},
       start_date=datetime.datetime(2021, 1, 1),
       catchup=False,
       schedule_interval=None,
   ) as dag:
       db_instance_ids = get_names(NUM_INSTANCES)
   
       rds_kwargs = get_rds_kwargs(names=db_instance_ids)
   
       create_rds = RdsCreateDbInstanceOperator.partial(
           task_id="create_rds",
           db_instance_class=INSTANCE_TYPE,
           engine="postgres",
       ).expand_kwargs(rds_kwargs)
   
       delete_rds = RdsDeleteDbInstanceOperator.partial(
           task_id="delete_rds",
           rds_kwargs={
               "SkipFinalSnapshot": True,
               "DeleteAutomatedBackups": True,
           },
       ).expand(
           db_instance_identifier=db_instance_ids,
       )
   
       rds_kwargs >> create_rds >> delete_rds
   ```
   
   You can increase the number of RDS instances this will spin up at once to 
increase the odds that you'll trigger this problem.
   
   ### Operating System
   
   Official Docker image
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-amazon==8.10.0
   ```
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   MacOS Sonoma 14.0
   
   ```
   ❯ docker version
   Client: Docker Engine - Community
    Version:           24.0.7
    API version:       1.42 (downgraded from 1.43)
    Go version:        go1.21.3
    Git commit:        afdd53b4e3
    Built:             Thu Oct 26 07:06:42 2023
    OS/Arch:           darwin/arm64
    Context:           colima
   
   Server:
    Engine:
     Version:          23.0.6
     API version:      1.42 (minimum version 1.12)
     Go version:       go1.20.4
     Git commit:       9dbdbd4b6d7681bd18c897a6ba0376073c2a72ff
     Built:            Fri May 12 13:54:36 2023
     OS/Arch:          linux/arm64
     Experimental:     true
    containerd:
     Version:          v1.7.0
     GitCommit:        1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
    runc:
     Version:          1.1.7
     GitCommit:        860f061b76bb4fc671f0f9e900f7d80ff93d4eb7
    docker-init:
     Version:          0.19.0
     GitCommit:
   ```
   ```
   ❯ docker-compose version
   Docker Compose version 2.23.0
   ```
   ```
   ❯ colima version
   colima version 0.5.6
   git commit: ceef812c32ab74a49df9f270e048e5dced85f932
   
   runtime: docker
   arch: aarch64
   client: v24.0.7
   server: v23.0.6
   ```
   ```
   ❯ limactl --version
   limactl version 0.18.0
   ```
   ```
   ❯ qemu-img --version
   qemu-img version 8.1.2
   Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] RdsDeleteDbInstanceOperator sometimes does not complete [airflow]

Reply via email to