Looking forward to these additional docs :)

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Thu, Feb 26, 2026 at 12:50 PM Haiyang Sun via dev <[email protected]>
wrote:

> Hi Holden,
>
> Thanks again for the detailed comments and suggestions.
>
> I’ve responded inline in the document and will revise the SPIP to make
> several areas more explicit. For visibility, here is a short summary:
>
> 1) Security (new IPC mechanism)
>
> We will add a dedicated security section. Overall, this should not be
> worse than the current socket-based implementation. Moving to gRPC may
> actually improve our position by leveraging existing ecosystem support for
> TLS, authentication, interceptors, and observability — which are harder to
> standardize correctly on top of a raw socket protocol.
>
> 2) Performance assumptions
>
> Agreed — we should back claims with systematic benchmarking. We have an
> early gRPC prototype with preliminary results comparable to the current
> socket path, but we will avoid strong claims until properly benchmarked.
> The existing Python/Scala paths will remain, and any default switch would
> only happen after meeting explicit performance goals.
>
> 3) Fallback / migration strategy
>
> We will make this explicit in the SPIP. The plan is to separate the
> transport layer from UDF processing logic in worker.py, allowing gRPC and
> socket to share the same execution logic. This enables safe fallback and
> reduces long-term dual-maintenance overhead.
>
> 4) Worker specification
>
> We do have a more detailed design and can publish it as a supporting
> document. The SPIP will clarify the expected structure and required
> metadata without going too deep into implementation detail.
>
> 5) Dependency management
>
> This will be defined in the worker specification. Each language
> implementation defines its dependency requirements, and clusters are
> expected to provision environments accordingly (as is already the case for
> Python today).
>
> 6) Unified query planning concerns
>
> The intent is not to force identical planning behavior across languages.
> The worker specification can expose metadata (e.g., pipelining support,
> concurrency, memory characteristics, data format constraints), allowing the
> planner to remain flexible and language-aware without hardcoding
> per-language rules.
>
> 7) Inter-UDF pipelining
>
> Pipelining is supported by the protocol design (similar to PySpark). The
> init message can declare multiple UDFs and define chaining and input
> mappings. Whether a language supports this can be expressed in worker
> metadata so planning can respect it.
>
> Hopefully this addresses the main concerns. I’ll update the SPIP to
> reflect these clarifications more explicitly.
>
> Thanks again for the thoughtful review.
>
>
>

Reply via email to