moritzsanne opened a new issue, #38959: URL: https://github.com/apache/airflow/issues/38959
### Apache Airflow Provider(s) amazon ### Versions of Apache Airflow Providers [apache-airflow-providers-amazon](https://airflow.apache.org/docs/apache-airflow-providers-amazon/8.19.0) 8.19.0 ### Apache Airflow version 2.8.3 ### Operating System Amazon Linux 2; Kernel Version: 5.10.209-198.812.amzn2.x86_64 ### Deployment Official Apache Airflow Helm Chart ### Deployment details We deploy airflow on EKS using the official Helm chart. ### What happened We are deploying a Glue Job using the GlueJobOperator with the following configuration: ``` GlueJobOperator( job_name="weather_data_prepared_local_file", script_location=str(Path(__file__).resolve().parent / "scripts/weather_data_prepared.py"), s3_bucket="aws-glue-temporary-bucket", task_id="WeatherGlueJob", iam_role_name="eks.data.airflow.glue.executor", create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 2, "WorkerType": "G.1X"}, update_config=True, aws_conn_id='datalake', dag=dag ) ``` This works fine for the first run of our DAG and the script file gets uploaded to `artifacts/glue-scripts/weather_data_prepared.py` However, when we trigger the DAG for a second run, it fails because the file already exists. ``` [2024-04-12T08:13:33.615+0000] {glue.py:173} INFO - Initializing AWS Glue Job: weather_data_prepared [2024-04-12T08:13:33.659+0000] {base.py:83} INFO - Using connection ID 'datalake' for task execution [2024-04-12T08:13:34.265+0000] {taskinstance.py:2731} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 44 result = _execute_callable(context=context, **execute_callable_kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 41 return execute_callable(context=context, **execute_callable_kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g glue_job_run = self.glue_job_hook.initialize_job(self.script_args, self.run_job_kwargs) File "/usr/local/lib/python3.8/functools.py", line 967, in __get__ val = self.func(instance) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g s3_hook.load_file( File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py return func(*bound_args.args, **bound_args.kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py return func(*bound_args.args, **bound_args.kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py raise ValueError(f"The key {key} already exists.") ValueError: The key artifacts/glue-scripts/weather_data_prepared.py already exists. ``` ### What you think should happen instead We are of the opinion, that the file on S3 should be overwritten for subsequent DAG executions. So that consecutive runs of GlueJobOperators using local script locations do not fail. This enables us to subject our script files to version control and CI/CD pipelines. ### How to reproduce 1. Create a DAG with the GlueJobOperator, referencing a local file and a S3 Bucket. 2. Run the DAG twice. ### Anything else _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
