lisancao opened a new pull request, #66498: URL: https://github.com/apache/airflow/pull/66498
The Spark 4.x `spark-pipelines` CLI is Connect-native (uses `SparkSession.builder.getOrCreate()` and rejects `--master` / `--deploy-mode` with `SparkException: Remote cannot be specified with master and/or deploy mode`), but `SparkPipelinesHook` inherits from `SparkSubmitHook` and unconditionally emits the spark-submit cluster-manager flags. The bash launcher itself also starts a JVM `SparkContext` that binds an in-process Connect server, colliding with a long-running daemon when `SPARK_REMOTE` is set. This makes `SparkPipelinesOperator` unusable against any `spark_connect`- typed connection even though `SparkConnectHook` ships in the same provider. The fix overrides `SparkPipelinesHook._resolve_connection()`: when the configured `conn_id` resolves to a `spark_connect` connection, populate `spark_remote` from `SparkConnectHook.get_connection_url()` and zero out the cluster-manager fields. Downstream methods then branch on `self._connection["spark_remote"]`: * `_get_spark_binary_path()` returns `[sys.executable, "-m", "pyspark.pipelines.cli"]` for Connect connections (bypasses the bash launcher's JVM SparkContext) and `["spark-pipelines"]` for everything else. * `_build_spark_pipelines_command()` skips `_build_spark_common_args()` for Connect (no `--master` / `--deploy-mode` / `--name`). * `submit_pipeline()` sets `SPARK_REMOTE` in the subprocess environment, preserving any caller-supplied override via `env_vars` for failover. Spark-submit-style behaviour (`spark` / `yarn` / `k8s` connection types) is unchanged, connection-lookup failures fall through to the parent's resolver. All existing tests pass without modification. Adds three tests for the spark_connect path covering (a) command shape without cluster-manager flags, (b) `SPARK_REMOTE` injection from the connection URI, and (c) caller-supplied `SPARK_REMOTE` preservation. --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes — Claude Code (Opus 4.7, 1M context) Generated-by: Claude Code (Opus 4.7) following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
