junaiddshaukat commented on issue #18479: URL: https://github.com/apache/beam/issues/18479#issuecomment-3939228313
> [@junaiddshaukat](https://github.com/junaiddshaukat) Hi [@junaiddshaukat](https://github.com/junaiddshaukat), thanks for the correction — I apologize for the mix-up. My contributions are actually to the Beam Python SDK: [#37672](https://github.com/apache/beam/pull/37672) (fix GroupBy snippet tests and re-enable skipped assertions) and [#37674](https://github.com/apache/beam/pull/37674) (make GCS filesystem lookup lazy to match S3 behavior). I shouldn't have referenced TypeScript SDK PRs in my original comment. Thank you for the detailed answers. A couple of follow-up thoughts: > > * On watermark advancement: For partitions that are idle with no new records arriving, would we still need a periodic wall-clock punctuation to trigger the watermark check? Or does the source reader advance independently of data flow? > * On error handling: Since we have manual commit control, failed bundles could potentially replay from the last committed offset rather than failing the whole job. Worth thinking about early given it affects how bundle boundaries are designed. > > I'm happy to contribute to any part of this project. Is there an aspect of the implementation where an additional contributor would be helpful? On the follow-up questions: **Watermark with idle partitions:** Yes, we'd need a periodic wall-clock punctuation to check source readers even when no data arrives. The source reader's `getCurrentTimestamp()` provides the watermark independently of data flow, but we need to poll it periodically. KS `punctuate(WALL_CLOCK_TIME)` can handle this — it runs independently of stream-time advancement. **Error handling / bundle replay:** Good point. With exactly-once and manual commit, a failed bundle means the transaction is aborted and offsets aren't committed. KS will replay from the last committed offset on recovery, effectively retrying the failed bundle. This aligns with Beam's bundle retry semantics. We should document this in the design doc. Regarding scope — if GSoC allows only one contributor per project, the current scope (Read, ParDo, GBK, Combine, Window, Flatten) is well-sized for a 175-hour medium project. Stateful ParDo and splittable DoFn are already listed as stretch goals and could be follow-up work after GSoC. That said, additional contributors are always welcome for parallel efforts like documentation or testing infrastructure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
