mynameborat commented on PR #26287: URL: https://github.com/apache/beam/pull/26287#issuecomment-1511741941
Overall, I feel the PR doesn't address/improve much rather introduces more confusion. Here are the reasons > A new element is always added, but a bundle is not always started. The name `tryStartBundle` already reflects this although in retrospection, I feel this isn't something we should necessarily cascade through names rather through API behavior. e.g., it should `start` with semantics calling out as idempotent behavior within a bundle while lifecycle start for every new bundle. Additionally, the symmetry of `tryFinishBundle` to `countNewElement` doesn't read well. What does it mean for the caller to invoke `tryFinishBundle` when there is nothing that conveys the caller to start one. >The only usage of this method is in DoFnOp.processElement, it reads better to context. Not sure I agree with this as well. `DoFnOp.processElement` is a bridge between samza operator and beam ParDo and directly interacts with `PushbackSideInputDoFnRunner`. The sequence of `PushbackSideInputDoFnRunner` is `startBundle`, many `processElementInReadyWindows` and `finishBundle` which expects the callers to keep track of bundle boundaries and invoke these methods appropriately. So it very much reads with the context of `tryStartBundle` --> `startBundle`, `processElementInReadyWindows` and `tryFinishBundle` --> `finishBundle`. All said, I think there is scope for refactor and improvements as I see. e.g., The existing bundle manager does two things 1. Handle bundle management (track bundles, invokes start and finish lifecycle method of the underlying runner) 2. Process messages, handle watermark and process timers. It is the way it is because we don't wait until we create a complete bundle and delegate it to the underlying runner as and when we receive elements. Due to this behavior, 1 & 2 exists together. It is possible to separate them out and have the `BundleProcessor` which handles 2) and is capable of handling early delegation (handle uncommitted bundles) vs lazy delegation (end of a fully formed bundle) and potentially handle watermark and timers as well depending on which model it runs. All our use cases (classic & portability) uses the early delegation strategy and hence not critical to do the above mentioned refactor. There might be benefits in doing above if we plan to have multiple bundles in which case the bundle management might become a bit heavy and so does 2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
