junaiddshaukat commented on issue #18479: URL: https://github.com/apache/beam/issues/18479#issuecomment-3938416809
> Hi [@je-ik](https://github.com/je-ik) and [@junaiddshaukat](https://github.com/junaiddshaukat) , I've been studying the design document for the portable Kafka Streams runner and wanted to share a few observations and questions. > > I'm also interested in contributing to this project for GSoC 2026 — I've been contributing to the Beam TypeScript SDK (merged PRs [#37214](https://github.com/apache/beam/pull/37214) and [#37466](https://github.com/apache/beam/pull/37466)) and have been ramping up on the portability framework. > > * On the Impulse bootstrap topic (Section 4.1): The `__beam_impulse` topic approach makes sense for single-pipeline scenarios, but I'm wondering how multi-pipeline isolation works. If two pipelines run concurrently, do they share the same topic and state store? Could this cause interference between pipelines, or is the applicationId sufficient to isolate state per pipeline? > * On watermark advancement without data (Section 12, Open Question 3): This is listed as an open question but I didn't see a proposed direction. The Flink runner uses a dedicated watermark emit thread for this — is that a viable approach here, or does Kafka Streams' single-threaded processor model make that problematic? > * On error handling for ExecutableStage: I didn't see anything in the design about what happens when the SDK harness fails to process a record. Is the expectation that Kafka's retry mechanism handles this, or would we need explicit dead letter queue handling similar to other runners? > * On the ValidatesRunner test scope (Section 10): The doc mentions running "a subset" of ValidatesRunner tests, would it make sense to define the specific test classes that should pass as part of the GSoC deliverable? That would make the acceptance criteria clearer. > > Happy to dig into any of these further or help with the design doc if useful. Hi @MansiSingh17, thanks for the interest in the project and the thoughtful questions. Just a small correction — PRs #37214 and #37466 are my contributions to the Beam TypeScript SDK. Perhaps you meant to reference different PR numbers? Regarding your questions: 1. **Impulse topic isolation:** Good point. The `applicationId` in Kafka Streams provides namespace isolation — each application has its own consumer group and state stores. For multi-pipeline scenarios, each pipeline would use a unique `applicationId`, so they wouldn't share state or interfere with each other. The bootstrap topic name could also include the `applicationId` as a prefix for additional isolation. 2. **Watermark advancement without data:** The Source's reader has a `getCurrentTimestamp()` method that provides timestamps independently of input data. We periodically call this and update the output watermark as `min(timestamp_of_all_unfinished_readers_in_task)`. This was discussed with @je-ik in the design doc review. 3. **Error handling for ExecutableStage:** Good question — this is something we can address during implementation. The Flink runner's `ExecutableStageDoFnOperator` handles SDK harness failures by failing the bundle and retrying. We'd follow a similar pattern. 4. **ValidatesRunner test scope:** Agreed that defining specific test classes would help. We can refine this during implementation based on which transforms are supported at each stage. The design document has been through several review rounds with @je-ik. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
