albertusk95 opened a new pull request #7075: [AIRFLOW-6212] SparkSubmitHook 
resolve connection
URL: https://github.com/apache/airflow/pull/7075
 
 
   **Problem**
   
   I tried to use `SparkSubmitOperator` using standalone cluster first. 
Unfortunately, the `spark-submit` task was failed. The following exception 
occurred.
   ```
   airflow.exceptions.AirflowException: Cannot execute: [path/to/spark-submit, 
'--master', host:port, job_file.py]
   ```
   
   The first thing that came up into my mind was why the master address 
excluded the `spark://` prefix. So it should be like `--master 
spark://host:port`. I performed a quick check to the source code and found that 
such a thing (scheme addition) hadn't been handled. Please take a look at the 
following code snippet 
[source](https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py#L171).
   
   After reviewing the subsequent method callings, it turned out that the 
driver status tracking feature won't be utilised at all because of the above 
bug. Look at the following code snippet.
   
   ```python
   def _resolve_should_track_driver_status(self):
        """
        Determines whether to not this hook should poll the spark driver status 
through subsequent spark-submit status requests after the initial spark-submit 
request
        :return: if the driver status should be tracked
        """
        return ('spark://' in self._connection['master'] and 
self._connection['deploy_mode'] == 'cluster')
   ```
   
   The above method will always return `False` as the spark master's address 
doesn't start with the scheme, such as `spark://`.
   
   Later on, I investigated the `Connection` module 
(_airflow.models.connection_) further and found that if we provide the URI (ex: 
_spark://host:port_), then the attributes of the `Connection` object will be 
derived via URI parsing.
   
   When parsing the host, the resulting value was only the hostname without the 
scheme. It also becomes a critical enough bug.
   
   **Proposed Solution**
   
   I think we don't really need the whole URI. I mean, when we store the 
connection data as an environment variable, we could just specify the URI parts 
in form of JSON. This approach is mainly used to tackle the URI parsing problem.
   
   In this case, the `conn_id` will still be preserved.
   
   Take a look at the following example (`conn_id` = "spark_default"). For 
simplicity, let's presume that `extra` is in JSON form.
   
   ```
   AIRFLOW_CONN_SPARK_DEFAULT='{"conn_type": <conn_type>, "host":<host>, 
"port":<port>, "schema":<schema>, "extra":<extra>}'
   ```
   
   Even though this solution could reduce the false result returned by URI 
parsing, one need to strictly ensure that each attribute (host, port, scheme, 
etc.) should store the relevant value. I think it's much easier than creating a 
correct URI parser. Moreover, applying such a technique makes the whole 
connection data builder for both database & environment variable mode have the 
same pattern (both use a structured data specification).
   
   ---
   Link to JIRA issue: https://issues.apache.org/jira/browse/AIRFLOW-6212
   
   - [X] Description above provides context of the change
   - [X] Commit message starts with `[AIRFLOW-NNNN]`, where AIRFLOW-NNNN = JIRA 
ID*
   - [ ] Unit tests coverage for changes (not needed for documentation changes)
   - [X] Commits follow "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)"
   - [ ] Relevant documentation is updated including usage instructions.
   - [ ] I will engage committers as explained in [Contribution Workflow 
Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   (*) For document-only changes, no JIRA issue is needed. Commit message 
starts `[AIRFLOW-XXXX]`.
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)
 for more information.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to