lisancao opened a new pull request, #66498:
URL: https://github.com/apache/airflow/pull/66498

   The Spark 4.x `spark-pipelines` CLI is Connect-native (uses 
`SparkSession.builder.getOrCreate()` and rejects `--master` / `--deploy-mode` 
with `SparkException: Remote cannot be specified with master and/or deploy
   mode`), but `SparkPipelinesHook` inherits from `SparkSubmitHook` and 
unconditionally emits the spark-submit cluster-manager flags. The bash launcher 
itself also starts a JVM `SparkContext` that binds an in-process Connect 
server, colliding with a long-running daemon when `SPARK_REMOTE` is set.
   
   This makes `SparkPipelinesOperator` unusable against any `spark_connect`- 
typed connection even though `SparkConnectHook` ships in the same provider.
   
   The fix overrides `SparkPipelinesHook._resolve_connection()`: when the 
configured `conn_id` resolves to a `spark_connect` connection, populate 
`spark_remote` from `SparkConnectHook.get_connection_url()` and zero out the 
cluster-manager fields. Downstream methods then branch on 
`self._connection["spark_remote"]`:
   
   * `_get_spark_binary_path()` returns `[sys.executable, "-m", 
"pyspark.pipelines.cli"]` for Connect connections (bypasses the bash launcher's 
JVM SparkContext) and `["spark-pipelines"]` for everything else.
   * `_build_spark_pipelines_command()` skips `_build_spark_common_args()`  for 
Connect (no `--master` / `--deploy-mode` / `--name`).
   * `submit_pipeline()` sets `SPARK_REMOTE` in the subprocess environment, 
preserving any caller-supplied override via `env_vars` for failover.
   
   Spark-submit-style behaviour (`spark` / `yarn` / `k8s` connection types) is 
unchanged, connection-lookup failures fall through to the parent's resolver. 
All existing tests pass without modification.
   
   Adds three tests for the spark_connect path covering (a) command shape 
without cluster-manager flags, (b) `SPARK_REMOTE` injection from the connection 
URI, and (c) caller-supplied `SPARK_REMOTE` preservation.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes — Claude Code (Opus 4.7, 1M context)
   
   Generated-by: Claude Code (Opus 4.7) following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to