Looking forward to these additional docs :) Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Thu, Feb 26, 2026 at 12:50 PM Haiyang Sun via dev <[email protected]> wrote: > Hi Holden, > > Thanks again for the detailed comments and suggestions. > > I’ve responded inline in the document and will revise the SPIP to make > several areas more explicit. For visibility, here is a short summary: > > 1) Security (new IPC mechanism) > > We will add a dedicated security section. Overall, this should not be > worse than the current socket-based implementation. Moving to gRPC may > actually improve our position by leveraging existing ecosystem support for > TLS, authentication, interceptors, and observability — which are harder to > standardize correctly on top of a raw socket protocol. > > 2) Performance assumptions > > Agreed — we should back claims with systematic benchmarking. We have an > early gRPC prototype with preliminary results comparable to the current > socket path, but we will avoid strong claims until properly benchmarked. > The existing Python/Scala paths will remain, and any default switch would > only happen after meeting explicit performance goals. > > 3) Fallback / migration strategy > > We will make this explicit in the SPIP. The plan is to separate the > transport layer from UDF processing logic in worker.py, allowing gRPC and > socket to share the same execution logic. This enables safe fallback and > reduces long-term dual-maintenance overhead. > > 4) Worker specification > > We do have a more detailed design and can publish it as a supporting > document. The SPIP will clarify the expected structure and required > metadata without going too deep into implementation detail. > > 5) Dependency management > > This will be defined in the worker specification. Each language > implementation defines its dependency requirements, and clusters are > expected to provision environments accordingly (as is already the case for > Python today). > > 6) Unified query planning concerns > > The intent is not to force identical planning behavior across languages. > The worker specification can expose metadata (e.g., pipelining support, > concurrency, memory characteristics, data format constraints), allowing the > planner to remain flexible and language-aware without hardcoding > per-language rules. > > 7) Inter-UDF pipelining > > Pipelining is supported by the protocol design (similar to PySpark). The > init message can declare multiple UDFs and define chaining and input > mappings. Whether a language supports this can be expressed in worker > metadata so planning can respect it. > > Hopefully this addresses the main concerns. I’ll update the SPIP to > reflect these clarifications more explicitly. > > Thanks again for the thoughtful review. > > >
