Himanshu Jain created AIRFLOW-1255:
--------------------------------------

             Summary: SparkSubmitOperator logs do not stream correctly
                 Key: AIRFLOW-1255
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1255
             Project: Apache Airflow
          Issue Type: Bug
          Components: hooks, operators
    Affects Versions: Airflow 1.8
         Environment: Spark 1.6.0 with Yarn cluster
Airflow 1.8
            Reporter: Himanshu Jain
            Priority: Minor


Logging in SparkSubmitOperator does not work as intended (continuous logging as 
received in the subprocess). This is because, spark-submit internally redirects 
all logs to stdout (including stderr), which causes the current two iterator 
logging to get stuck with empty stderr pipe. The logs are written only when the 
subprocess finishes. This leads to yarn_application_id not being available 
until the end of application.
 Specifically,

{code:title= spark_submit_hook.py (lines 217-220)|borderStyle=solid}
self._sp = subprocess.Popen(spark_submit_cmd,
                                stdout=subprocess.PIPE,
                                stderr=subprocess.PIPE,
                                **kwargs)
{code}

needs to be changed to 
{code:title= spark_submit_hook.py|borderStyle=solid}
self._sp = subprocess.Popen(spark_submit_cmd,
                                stdout=subprocess.PIPE,
                                **kwargs)
{code}

with subsequent changes in the following lines.

I have not tested whether the issue exists with spark 2 versions as well or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to