junaiddshaukat commented on issue #18479:
URL: https://github.com/apache/beam/issues/18479#issuecomment-3938416809

   > Hi [@je-ik](https://github.com/je-ik) and 
[@junaiddshaukat](https://github.com/junaiddshaukat) , I've been studying the 
design document for the portable Kafka Streams runner and wanted to share a few 
observations and questions.
   > 
   > I'm also interested in contributing to this project for GSoC 2026 — I've 
been contributing to the Beam TypeScript SDK (merged PRs 
[#37214](https://github.com/apache/beam/pull/37214) and 
[#37466](https://github.com/apache/beam/pull/37466)) and have been ramping up 
on the portability framework.
   > 
   > * On the Impulse bootstrap topic (Section 4.1): The `__beam_impulse` topic 
approach makes sense for single-pipeline scenarios, but I'm wondering how 
multi-pipeline isolation works. If two pipelines run concurrently, do they 
share the same topic and state store? Could this cause interference between 
pipelines, or is the applicationId sufficient to isolate state per pipeline?
   > * On watermark advancement without data (Section 12, Open Question 3): 
This is listed as an open question but I didn't see a proposed direction. The 
Flink runner uses a dedicated watermark emit thread for this — is that a viable 
approach here, or does Kafka Streams' single-threaded processor model make that 
problematic?
   > * On error handling for ExecutableStage: I didn't see anything in the 
design about what happens when the SDK harness fails to process a record. Is 
the expectation that Kafka's retry mechanism handles this, or would we need 
explicit dead letter queue handling similar to other runners?
   > * On the ValidatesRunner test scope (Section 10): The doc mentions running 
"a subset" of ValidatesRunner tests, would it make sense to define the specific 
test classes that should pass as part of the GSoC deliverable? That would make 
the acceptance criteria clearer.
   > 
   > Happy to dig into any of these further or help with the design doc if 
useful.
   
   
   Hi @MansiSingh17, thanks for the interest in the project and the 
   thoughtful questions.
   
   Just a small correction — PRs #37214 and #37466 are my contributions 
   to the Beam TypeScript SDK. Perhaps you meant to reference different 
   PR numbers?
   
   Regarding your questions:
   
   1. **Impulse topic isolation:** Good point. The `applicationId` in 
      Kafka Streams provides namespace isolation — each application has 
      its own consumer group and state stores. For multi-pipeline 
      scenarios, each pipeline would use a unique `applicationId`, so 
      they wouldn't share state or interfere with each other. The 
      bootstrap topic name could also include the `applicationId` as 
      a prefix for additional isolation.
   
   2. **Watermark advancement without data:** The Source's reader has a 
      `getCurrentTimestamp()` method that provides timestamps independently 
      of input data. We periodically call this and update the output 
      watermark as `min(timestamp_of_all_unfinished_readers_in_task)`. 
      This was discussed with @je-ik in the design doc review.
   
   3. **Error handling for ExecutableStage:** Good question — this is 
      something we can address during implementation. The Flink runner's 
      `ExecutableStageDoFnOperator` handles SDK harness failures by 
      failing the bundle and retrying. We'd follow a similar pattern.
   
   4. **ValidatesRunner test scope:** Agreed that defining specific test 
      classes would help. We can refine this during implementation based 
      on which transforms are supported at each stage.
   
   The design document has been through several review rounds with @je-ik.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to