Re: [VOTE] SPIP: Language-Agnostic UDF Execution Protocol for Spark

Haiyang Sun via dev Thu, 26 Feb 2026 12:50:11 -0800

Hi Holden,

Thanks again for the detailed comments and suggestions.


I’ve responded inline in the document and will revise the SPIP to make
several areas more explicit. For visibility, here is a short summary:

1) Security (new IPC mechanism)

We will add a dedicated security section. Overall, this should not be worse
than the current socket-based implementation. Moving to gRPC may actually
improve our position by leveraging existing ecosystem support for TLS,
authentication, interceptors, and observability — which are harder to
standardize correctly on top of a raw socket protocol.

2) Performance assumptions

Agreed — we should back claims with systematic benchmarking. We have an
early gRPC prototype with preliminary results comparable to the current
socket path, but we will avoid strong claims until properly benchmarked.
The existing Python/Scala paths will remain, and any default switch would
only happen after meeting explicit performance goals.

3) Fallback / migration strategy

We will make this explicit in the SPIP. The plan is to separate the
transport layer from UDF processing logic in worker.py, allowing gRPC and
socket to share the same execution logic. This enables safe fallback and
reduces long-term dual-maintenance overhead.

4) Worker specification

We do have a more detailed design and can publish it as a supporting
document. The SPIP will clarify the expected structure and required
metadata without going too deep into implementation detail.

5) Dependency management

This will be defined in the worker specification. Each language
implementation defines its dependency requirements, and clusters are
expected to provision environments accordingly (as is already the case for
Python today).

6) Unified query planning concerns

The intent is not to force identical planning behavior across languages.
The worker specification can expose metadata (e.g., pipelining support,
concurrency, memory characteristics, data format constraints), allowing the
planner to remain flexible and language-aware without hardcoding
per-language rules.

7) Inter-UDF pipelining

Pipelining is supported by the protocol design (similar to PySpark). The
init message can declare multiple UDFs and define chaining and input
mappings. Whether a language supports this can be expressed in worker
metadata so planning can respect it.

Hopefully this addresses the main concerns. I’ll update the SPIP to reflect
these clarifications more explicitly.

Thanks again for the thoughtful review.

Re: [VOTE] SPIP: Language-Agnostic UDF Execution Protocol for Spark

Reply via email to