Beam Python expects DoFns to return an iterable that contains the actual
output elements. This is documented, and visible in examples, but it is
also a bit counter-intuitive.

We should definitely add a check in _OutputProcessor[1] to throw a more
expressive error if it receives a non-iterable.

Should we also let Beam error out if users return a string?
e.g. consider the following pipeline:
p | Create(['abc']) | ParDo(lambda x: x) | WriteToFile('myfile')

This pipeline would write three separate elements. Is this not a bit
awkward?

Erroring out when a string is returned would be the least surprising
solution for users, as opposed to having their strings getting broken down
into a bunch of single-char elements.

A con is that there may be users already relying on this functionality, so
that might be a breaking change. But I think it's still worth discussing.

Best
-P.

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L659
-- 
Got feedback? go/pabloem-feedback

Reply via email to