Himanshu Jain created AIRFLOW-1255:
--------------------------------------
Summary: SparkSubmitOperator logs do not stream correctly
Key: AIRFLOW-1255
URL: https://issues.apache.org/jira/browse/AIRFLOW-1255
Project: Apache Airflow
Issue Type: Bug
Components: hooks, operators
Affects Versions: Airflow 1.8
Environment: Spark 1.6.0 with Yarn cluster
Airflow 1.8
Reporter: Himanshu Jain
Priority: Minor
Logging in SparkSubmitOperator does not work as intended (continuous logging as
received in the subprocess). This is because, spark-submit internally redirects
all logs to stdout (including stderr), which causes the current two iterator
logging to get stuck with empty stderr pipe. The logs are written only when the
subprocess finishes. This leads to yarn_application_id not being available
until the end of application.
Specifically,
{code:title= spark_submit_hook.py (lines 217-220)|borderStyle=solid}
self._sp = subprocess.Popen(spark_submit_cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
**kwargs)
{code}
needs to be changed to
{code:title= spark_submit_hook.py|borderStyle=solid}
self._sp = subprocess.Popen(spark_submit_cmd,
stdout=subprocess.PIPE,
**kwargs)
{code}
with subsequent changes in the following lines.
I have not tested whether the issue exists with spark 2 versions as well or not.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)