[I] GlueJobOperator with local script location fails on consecutive runs [airflow]

via GitHub Fri, 12 Apr 2024 02:39:39 -0700


moritzsanne opened a new issue, #38959:
URL: https://github.com/apache/airflow/issues/38959


   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   
[apache-airflow-providers-amazon](https://airflow.apache.org/docs/apache-airflow-providers-amazon/8.19.0)
    8.19.0
   
   ### Apache Airflow version
   
   2.8.3
   
   ### Operating System
   
   Amazon Linux 2; Kernel Version: 5.10.209-198.812.amzn2.x86_64
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   We deploy airflow on EKS using the official Helm chart.
   
   
   ### What happened
   
   We are deploying a Glue Job using the GlueJobOperator with the following 
configuration:
   ```
   GlueJobOperator(
           job_name="weather_data_prepared_local_file",
           script_location=str(Path(__file__).resolve().parent / 
"scripts/weather_data_prepared.py"),
           s3_bucket="aws-glue-temporary-bucket",
           task_id="WeatherGlueJob",
           iam_role_name="eks.data.airflow.glue.executor",
           create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 2, 
"WorkerType": "G.1X"},
           update_config=True,
           aws_conn_id='datalake',
           dag=dag
       )
   ```
   
   This works fine for the first run of our DAG and the script file gets 
uploaded to `artifacts/glue-scripts/weather_data_prepared.py`
   However, when we trigger the DAG for a second run, it fails because the file 
already exists.
   
   ```
    [2024-04-12T08:13:33.615+0000] {glue.py:173} INFO - Initializing AWS Glue 
Job: weather_data_prepared 
    [2024-04-12T08:13:33.659+0000] {base.py:83} INFO - Using connection ID 
'datalake' for task execution
    [2024-04-12T08:13:34.265+0000] {taskinstance.py:2731} ERROR - Task failed 
with exception Traceback (most recent call last): 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 44 result = _execute_callable(context=context, **execute_callable_kwargs) 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 41 return execute_callable(context=context, **execute_callable_kwargs) 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g
 glue_job_run = self.glue_job_hook.initialize_job(self.script_args, 
self.run_job_kwargs) 
    File "/usr/local/lib/python3.8/functools.py", line 967, in __get__ val = 
self.func(instance) 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g
 s3_hook.load_file( 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py
 return func(*bound_args.args, **bound_args.kwargs) 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py
 return func(*bound_args.args, **bound_args.kwargs) 
    File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py
 raise ValueError(f"The key {key} already exists.") 
    ValueError: The key artifacts/glue-scripts/weather_data_prepared.py already 
exists.
   ```
   
   ### What you think should happen instead
   
   We are of the opinion, that the file on S3 should be overwritten for 
subsequent DAG executions.
   So that consecutive runs of GlueJobOperators using local script locations do 
not fail.
   This enables us to subject our script files to version control and CI/CD 
pipelines.
   
   ### How to reproduce
   
       1. Create a DAG with the GlueJobOperator, referencing a local file and a 
S3 Bucket.
       2. Run the DAG twice.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] GlueJobOperator with local script location fails on consecutive runs [airflow]

Reply via email to