Albertus Kelvin created AIRFLOW-6212:
----------------------------------------

             Summary: SparkSubmitHook failed to execute spark-submit to 
standalone cluster
                 Key: AIRFLOW-6212
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6212
             Project: Apache Airflow
          Issue Type: Bug
          Components: hooks, operators
    Affects Versions: 1.10.6
            Reporter: Albertus Kelvin
            Assignee: Albertus Kelvin


I was trying to submit a pyspark job with spark-submit using 
SparkSubmitOperator. I already set up the master appropriately via environment 
variable (AIRFLOW_CONN_SPARK_DEFAULT). The value was something like 
*spark://host:port*.

However, an exception occurred: 
{noformat}
airflow.exceptions.AirflowException: Cannot execute: ['path/to/spark-submit', 
'--master', 'host:port', 'job.py']
{noformat}

Turns out that the master should have *spark://* preceding the host:port. I 
checked the code and found that this wasn't handled.

{code:python}
conn = self.get_connection(self._conn_id)
if conn.port:
         conn_data['master'] = "{}:{}".format(conn.host, conn.port)
else:
         conn_data['master'] = conn.host
{code}

I think the protocol should be added like the following.

{code:python}
conn_data['master'] = "spark://{}:{}".format(conn.host, conn.port)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to