[I] Change the behavior of the spark_binary connection option in the spark_submit hook [airflow]

via GitHub Tue, 09 Jul 2024 11:19:42 -0700


seyoon-lim opened a new issue, #40675:
URL: https://github.com/apache/airflow/issues/40675


   ### Description
   
   The behavior of the `spark_binary` option differs when defined as a 
parameter and when defined as a connection.
   
   When defined as a parameter: There is no separate validation, so the actual 
path to the `spark-submit` file can be used. It can be 
`/some/path/spark-submit` or other than `spark-submit`.
   
   When defined as a connection: Only `spark-submit`, `spark2-submit`, and 
`spark3-submit` can be used.
   
   ```python
   # 
https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L202
   def __init__(
       ...
       spark_binary: str | None = None,
       ...
   ) -> None:
       ...
       self.spark_binary = spark_binary
   
   # 
https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L254-L261
   def _resolve_connection(self) -> dict[str, Any]:
       ...
       if not self.spark_binary:
           self.spark_binary = extra.get("spark-binary", DEFAULT_SPARK_BINARY)
           if self.spark_binary is not None and self.spark_binary not in 
ALLOWED_SPARK_BINARIES:
               raise RuntimeError(
                   f"The spark-binary extra can be on of 
{ALLOWED_SPARK_BINARIES} and it"
                   f" was `{self.spark_binary}`. Please make sure your spark 
binary is one of the"
                   f" allowed ones and that it is available on the PATH"
               )
       ...
   
   # 
https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L128-L134
   ...
   "spark-binary": StringField(
                   lazy_gettext("Spark binary"),
                   widget=BS3TextFieldWidget(),
                   description=f"Must be one of: {', 
'.join(ALLOWED_SPARK_BINARIES)}",
                   validators=[any_of(ALLOWED_SPARK_BINARIES)],
                   default=DEFAULT_SPARK_BINARY,
               ),
   ...
   ```
   
![image](https://github.com/apache/airflow/assets/172459582/51df291b-88b5-40b7-b0cc-ef9bf3a51a7a)
   
   Suggestions:
   1. If `spark-submit` keyword validation is necessary, improve it to validate 
using `endswith` for both the parameter and connection definitions.
   2. If keyword validation is not necessary, remove the validation logic for 
both parameter and connection definitions.
   
   
   ### Use case/motivation
   
   In my case, I have multiple versions of Spark installed, and I want to 
convey this information via the connection.
   However, because I can't specify the path, I can't use the connection 
settings, and I have to manually input the spark_binary into the operator, 
which is inconvenient.
   
   Therefore, I think it would be beneficial to improve the differing 
validation logic between parameters and connections.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Change the behavior of the spark_binary connection option in the spark_submit hook [airflow]

Reply via email to