SameerMesiah97 opened a new issue, #62301:
URL: https://github.com/apache/airflow/issues/62301
### Apache Airflow Provider(s)
google
### Versions of Apache Airflow Providers
`apache-airflow-providers-google>=20.0.0rc1`
### Apache Airflow version
main
### Operating System
Debian GNU/Linux 12 (bookworm)
### Deployment
Other
### Deployment details
_No response_
### What happened
When using `GKECreateClusterOperator`, a GKE cluster may be successfully
created even when the GCP service account has partial GKE permissions, for
example lacking `container.operations.get`.
In this scenario, the operator successfully calls `create_cluster` and the
GKE cluster begins provisioning in GCP. However, subsequent steps—such as
polling the operation in non-deferrable mode—fail due to insufficient
permissions.
The Airflow task then fails, but the GKE cluster continues provisioning or
remains active in GCP, resulting in leaked infrastructure and ongoing cost.
This can occur, for example, when the service account allows
`container.clusters.create` but explicitly denies `container.operations.get`,
which is required to monitor the long-running operation.
### What you think should happen instead
If the operator fails after successfully initiating cluster creation (for
example due to missing `container.operations.get` or other follow-up
permissions), it should make a best-effort attempt to clean up the partially
created resource by deleting the cluster.
Cleanup should be attempted opportunistically (i.e. only if the cluster name
is known and deletion permissions are available), and failure to clean up
should not mask or replace the original exception.
### How to reproduce
1. Create a custom IAM role that allows `container.clusters.create` and
denies/omits `container.operations.get`
2. Create a service account and attach this custom role.
3. Create a GCP connection in Airflow using this service account.
(For example: `gcp_cloud_default`.)
4. Use the following DAG:
(Please replace `<PROJECT_ID>` and `<REGION>`
with your GCP project ID and a valid region, respectively.)
```python
from datetime import datetime
from airflow import DAG
from airflow.providers.google.cloud.operators.kubernetes_engine import
GKECreateClusterOperator
with DAG(
dag_id="gke_partial_auth_cluster_leak_repro",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
) as dag:
create_cluster = GKECreateClusterOperator(
task_id="create_gke_cluster",
project_id= <PROJECT_ID>,
location=<REGION>,
body={
"name": "leaky-gke-cluster",
"initial_node_count": 1,
},
gcp_conn_id="gcp_cloud_default",
deferrable=False, # triggers polling via operations.get
)
```
5. Trigger the DAG.
**Observed Behaviour**
The task fails with:
`PermissionDenied: Required "container.operations.get" permission(s)`
However, the GKE cluster continues to provision in the background.
### Anything else
GKE clusters begin provisioning immediately once creation is initiated. Even
if the Airflow task fails shortly after, the cluster may continue creating and
eventually become active.
When failures occur after a successful create call (for example, due to
partially scoped IAM permissions), leaked clusters can result in unnecessary
cost and manual cleanup effort. This pattern is not novel in Airflow. Similar
behaviour has been accepted in AWS resource-creation operators, for example
with Amazon Redshift cluster creation (see PR #61333), where infrastructure can
be created successfully but leak if subsequent steps fail. Aligning the GKE
operator with a best-effort cleanup approach would therefore not introduce a
new behavioural precedent. It would bring it in line with existing provider
patterns.
**Relying solely on teardown tasks is not sufficient, as that shifts
responsibility for preventing resource leaks onto DAG authors**. Operators that
create infrastructure should make reasonable best-effort attempts to clean up
resources they successfully create, even if later steps fail.
While the GKE API does not always accept deletion requests during
`PROVISIONING`, that limitation does not preclude best-effort cleanup logic
(e.g. retrying deletion or attempting deletion once the cluster becomes
deletable).
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]