Albertus Kelvin created AIRFLOW-6212:
----------------------------------------
Summary: SparkSubmitHook failed to execute spark-submit to
standalone cluster
Key: AIRFLOW-6212
URL: https://issues.apache.org/jira/browse/AIRFLOW-6212
Project: Apache Airflow
Issue Type: Bug
Components: hooks, operators
Affects Versions: 1.10.6
Reporter: Albertus Kelvin
Assignee: Albertus Kelvin
I was trying to submit a pyspark job with spark-submit using
SparkSubmitOperator. I already set up the master appropriately via environment
variable (AIRFLOW_CONN_SPARK_DEFAULT). The value was something like
*spark://host:port*.
However, an exception occurred:
{noformat}
airflow.exceptions.AirflowException: Cannot execute: ['path/to/spark-submit',
'--master', 'host:port', 'job.py']
{noformat}
Turns out that the master should have *spark://* preceding the host:port. I
checked the code and found that this wasn't handled.
{code:python}
conn = self.get_connection(self._conn_id)
if conn.port:
conn_data['master'] = "{}:{}".format(conn.host, conn.port)
else:
conn_data['master'] = conn.host
{code}
I think the protocol should be added like the following.
{code:python}
conn_data['master'] = "spark://{}:{}".format(conn.host, conn.port)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)