[GitHub] [beam] mynameborat commented on pull request #26287: Remane tryStartBundle to countNewElement.

via GitHub Mon, 17 Apr 2023 09:56:09 -0700


mynameborat commented on PR #26287:
URL: https://github.com/apache/beam/pull/26287#issuecomment-1511741941


   Overall, I feel the PR doesn't address/improve much rather introduces more 
confusion. Here are the reasons
   
   > A new element is always added, but a bundle is not always started.
   The name `tryStartBundle` already reflects this although in retrospection, I 
feel this isn't something we should necessarily cascade through names rather 
through API behavior. e.g., it should `start` with semantics calling out as 
idempotent behavior within a bundle while lifecycle start for every new bundle.
   
   Additionally, the symmetry of `tryFinishBundle` to `countNewElement` doesn't 
read well. What does it mean for the caller to invoke `tryFinishBundle` when 
there is nothing that conveys the caller to start one. 
   
   >The only usage of this method is in DoFnOp.processElement, it reads better 
to context.
   
   Not sure I agree with this as well. `DoFnOp.processElement` is a bridge 
between samza operator and beam ParDo and directly interacts with 
`PushbackSideInputDoFnRunner`. The sequence of `PushbackSideInputDoFnRunner` is 
`startBundle`, many `processElementInReadyWindows` and `finishBundle` which 
expects the callers to keep track of bundle boundaries and invoke these methods 
appropriately. 
   
   So it very much reads with the context of `tryStartBundle` --> 
`startBundle`, `processElementInReadyWindows` and `tryFinishBundle` --> 
`finishBundle`. 
   
   All said, I think there is scope for refactor and improvements as I see. 
e.g., 
   
   The existing bundle manager does two things 
   1. Handle bundle management (track bundles, invokes start and finish 
lifecycle method of the underlying runner)
   2. Process messages, handle watermark and process timers.
   
   It is the way it is because we don't wait until we create a complete bundle 
and delegate it to the underlying runner as and when we receive elements. Due 
to this behavior, 1 & 2 exists together. It is possible to separate them out 
and have the `BundleProcessor` which handles 2) and is capable of handling 
early delegation (handle uncommitted bundles) vs lazy delegation (end of a 
fully formed bundle) and potentially handle watermark and timers as well 
depending on which model it runs.
   
   All our use cases (classic & portability) uses the early delegation strategy 
and hence not critical to do the above mentioned refactor. There might be 
benefits in doing above if we plan to have multiple bundles in which case the 
bundle management might become a bit heavy and so does 2). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] mynameborat commented on pull request #26287: Remane tryStartBundle to countNewElement.

Reply via email to