junaiddshaukat commented on issue #18479:
URL: https://github.com/apache/beam/issues/18479#issuecomment-3939228313

   > [@junaiddshaukat](https://github.com/junaiddshaukat) Hi 
[@junaiddshaukat](https://github.com/junaiddshaukat), thanks for the correction 
— I apologize for the mix-up. My contributions are actually to the Beam Python 
SDK: [#37672](https://github.com/apache/beam/pull/37672) (fix GroupBy snippet 
tests and re-enable skipped assertions) and 
[#37674](https://github.com/apache/beam/pull/37674) (make GCS filesystem lookup 
lazy to match S3 behavior). I shouldn't have referenced TypeScript SDK PRs in 
my original comment. Thank you for the detailed answers. A couple of follow-up 
thoughts:
   > 
   > * On watermark advancement: For partitions that are idle with no new 
records arriving, would we still need a periodic wall-clock punctuation to 
trigger the watermark check? Or does the source reader advance independently of 
data flow?
   > * On error handling: Since we have manual commit control, failed bundles 
could potentially replay from the last committed offset rather than failing the 
whole job. Worth thinking about early given it affects how bundle boundaries 
are designed.
   > 
   > I'm happy to contribute to any part of this project. Is there an aspect of 
the implementation where an additional contributor would be helpful?
   
   On the follow-up questions:
   
   **Watermark with idle partitions:** Yes, we'd need a periodic 
   wall-clock punctuation to check source readers even when no data 
   arrives. The source reader's `getCurrentTimestamp()` provides the 
   watermark independently of data flow, but we need to poll it 
   periodically. KS `punctuate(WALL_CLOCK_TIME)` can handle this — 
   it runs independently of stream-time advancement.
   
   **Error handling / bundle replay:** Good point. With exactly-once 
   and manual commit, a failed bundle means the transaction is aborted 
   and offsets aren't committed. KS will replay from the last committed 
   offset on recovery, effectively retrying the failed bundle. This 
   aligns with Beam's bundle retry semantics. We should document this 
   in the design doc.
   
   Regarding scope — if GSoC allows only one contributor per project, 
   the current scope (Read, ParDo, GBK, Combine, Window, Flatten) is 
   well-sized for a 175-hour medium project. Stateful ParDo and 
   splittable DoFn are already listed as stretch goals and could be 
   follow-up work after GSoC. That said, additional contributors are 
   always welcome for parallel efforts like documentation or testing 
   infrastructure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to