AndyN5 opened a new issue, #45840:
URL: https://github.com/apache/airflow/issues/45840

   ### Apache Airflow version
   
   2.10.4
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   
   The pull request addresses the addition of deferrable functionality to the 
DataprocJobBaseOperator and DataprocSubmitJobOperator in Apache Airflow. This 
change allows for asynchronous job submission and improved handling of 
long-running tasks within Google Cloud Dataproc. As a result, the operators are 
now more efficient and capable of deferring job execution, providing better 
resource management and flexibility in scheduling.
   
   Before this PR, the operators did not support deferrable capabilities, which 
meant that they could not handle longer-running jobs in a non-blocking way. 
This led to unnecessary blocking of resources and poor scalability for 
workflows that required long-running cloud tasks, such as large data processing 
jobs on Dataproc.
   
   ### What you think should happen instead?
   
   The updated DataprocSubmitJobOperator and DataprocJobBaseOperator should now 
allow asynchronous job submission, enabling long-running tasks like Dataproc 
jobs to run without blocking other tasks in the Airflow DAG. When the operator 
is used, it should return control to the workflow immediately while the job 
continues to run in the background. The operator should be able to properly 
defer job execution until it's ready to resume and track the status of the job.
   
   
   
   ### How to reproduce
   
   Set up an Airflow environment with the DataprocSubmitJobOperator and 
DataprocJobBaseOperator from the Google Cloud provider.
   
   Create a DAG that uses one of these operators to submit a Dataproc job (such 
as a Spark or Hadoop job) without the asynchronous capability.
   
   Run the DAG. Observe that the task execution blocks the workflow until the 
Dataproc job finishes, leading to inefficiency or task timeouts if the job runs 
for a long time.
   
   Now, apply the changes from the PR (DataprocSubmitJobOperator with async 
capability), which allows jobs to run asynchronously. The task should return 
control to the DAG immediately and defer until the job completes.
   
   Compare the behavior before and after the patch. The issue should manifest 
when trying to run long jobs without the asynchronous execution mode, where the 
task won't release control until the job completes. After applying the patch, 
the behavior should change with more efficient handling of long-running tasks.
   
   ### Operating System
   
   Windows 
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to