seyoon-lim opened a new issue, #40675: URL: https://github.com/apache/airflow/issues/40675
### Description The behavior of the `spark_binary` option differs when defined as a parameter and when defined as a connection. When defined as a parameter: There is no separate validation, so the actual path to the `spark-submit` file can be used. It can be `/some/path/spark-submit` or other than `spark-submit`. When defined as a connection: Only `spark-submit`, `spark2-submit`, and `spark3-submit` can be used. ```python # https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L202 def __init__( ... spark_binary: str | None = None, ... ) -> None: ... self.spark_binary = spark_binary # https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L254-L261 def _resolve_connection(self) -> dict[str, Any]: ... if not self.spark_binary: self.spark_binary = extra.get("spark-binary", DEFAULT_SPARK_BINARY) if self.spark_binary is not None and self.spark_binary not in ALLOWED_SPARK_BINARIES: raise RuntimeError( f"The spark-binary extra can be on of {ALLOWED_SPARK_BINARIES} and it" f" was `{self.spark_binary}`. Please make sure your spark binary is one of the" f" allowed ones and that it is available on the PATH" ) ... # https://github.com/apache/airflow/blob/22ec72606363405573774c0b2de2dcbc4ff41588/airflow/providers/apache/spark/hooks/spark_submit.py#L128-L134 ... "spark-binary": StringField( lazy_gettext("Spark binary"), widget=BS3TextFieldWidget(), description=f"Must be one of: {', '.join(ALLOWED_SPARK_BINARIES)}", validators=[any_of(ALLOWED_SPARK_BINARIES)], default=DEFAULT_SPARK_BINARY, ), ... ```  Suggestions: 1. If `spark-submit` keyword validation is necessary, improve it to validate using `endswith` for both the parameter and connection definitions. 2. If keyword validation is not necessary, remove the validation logic for both parameter and connection definitions. ### Use case/motivation In my case, I have multiple versions of Spark installed, and I want to convey this information via the connection. However, because I can't specify the path, I can't use the connection settings, and I have to manually input the spark_binary into the operator, which is inconvenient. Therefore, I think it would be beneficial to improve the differing validation logic between parameters and connections. ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
