chrisqiqiu commented on PR #38429: URL: https://github.com/apache/beam/pull/38429#issuecomment-4886486292
Thanks for the feedback — the per-element cost is a fair point. Here's how I'd address it: 1. Skip DoFns that use "yield". A process method that uses "yield" is a generator. calling it returns a generator object, which can never be a str/bytes/dict. So the bug is impossible for these. I can detect this at construction time with inspect.isgeneratorfunction() and skip the check entirely. Since "yield" is the common style, this removes the cost for most DoFns. 2. For DoFns that use "return", only check the first element. As @jrmccluskey pointed out, this bug is deterministic. if a DoFn returns a str once, it does so every time, so it's caught on the first element. I can check once, then set a flag to skip all later elements. That makes it a one-time check per DoFn instead of per element. I looked at doing this fully at construction time, but the bug depends on the actual value returned by "return" at runtime, so it can't be detected statically without relying more on trivial inference (which @jrmccluskey wanted to avoid). Does this approach sound good? If so, I'll update the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
