cadet354 opened a new issue #15280:
URL: https://github.com/apache/airflow/issues/15280


   **Apache Airflow version**: v2.0.1
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**:
   - bare metal
   - **OS** (e.g. from /etc/os-release):
   - Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-65-generic x86_64)
   - **Kernel** (e.g. `uname -a`):
   - Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 
x86_64 x86_64 GNU/Linux
   - **Install tools**:
   - miniconda
   - **Others**:
   - python 3.7
   - hadoop 2.9.2
   - spark 2.4.7
   
   **What happened**:
   I have two problems:
   1. When starts DAG with spark-submit on yarn with deploy="cluster" airflow 
doesn't track state driver. 
   Therefore when yarn fails, the DAG's state remains in "running mode.
   
   2. When I manual stop DAG's job, for example, mark it as "failed", in Yarn 
cluster the same job is still running. 
   This error occurs because empty environment is passed  to subprocess.Popen  
and hadoop bin directory doesn't exist in PATH.
   
   ERROR - [Errno 2] No such file or directory: 'yarn': 'yarn'
   
   **What you expected to happen**:
   In the first case, move the task to the "failed" state.
   In the second case, stopping the task on yarn cluster.
   <!-- What do you think went wrong? -->
   
   **How to reproduce it**:
   Reproduce the first issue: Start spark_submit on yarn cluster in  
deploy="cluster" and master="yarn", then kill task on yarn UI. On airflow task 
state remains "running".
   Reproduce the second issue: Start spark_submit  on yarn cluster in  
deploy="cluster" and master="yarn", then manually change this running job's 
state to "failed", on Hadoop cluster the same job remains in running state.
   
   **Anything else we need to know**:
   I propose the following changes to the  to the 
airflow\providers\apache\hooks\SparkSubmitHook.py:
   1. line 201: 
   ```python
   return 'spark://' in self._connection['master'] and 
self._connection['deploy_mode'] == 'cluster' 
   ```
   change to 
   ```python        
   return ('spark://' in self._connection['master'] or 
self._connection['master'] == "yarn") and \
                  (self._connection['deploy_mode'] == 'cluster')
   ```
   2. line 659
   ```python
   env = None
   ```
   change to 
   ``` python
   env = {**os.environ.copy(), **(self._env if self._env else {})}
   ```
   Applying this patch solved the above issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to