Re: Confusing multiple output semantics in Python

Ning Kang Thu, 07 Nov 2019 14:27:41 -0800

Hi Sam,

Thanks for clarifying the accessor to output when building a pipeline.


Internally, we have AppliedPTransform, where the output is always a
dictionary:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L770
And it seems to me that with key 'None', the output will be the main output.

Ning.

On Thu, Nov 7, 2019 at 1:39 PM Sam Rohde <[email protected]> wrote:

> Hi All,
>
> In the Python SDK there are three ways of representing the output of a
> PTransform with multiple PCollections:
>
>    - dictionary: PCollection tag --> PCollection
>    - tuple: index --> PCollection
>    - DoOutputsTuple: tag, index, or field name --> PCollection
>
> I find this inconsistent way of accessing multiple outputs to be
> confusing. Say that you have an arbitrary PTransform with multiple outputs.
> How do you know how to access an individual output without looking at the
> source code? *You can't!* Remember there are three representations of
> multiple outputs. So, you need to look at the output type and determine
> what the output actually is.
>
> What purpose does it serve to have three different ways of representing a
> single concept of multiple output PCollections?
>
> My proposal is to have a single representation analogous to Java's
> PCollectionTuple. With this new type you will able to access PCollections
> by tag with the "[ ]" operator or by field name. It should also up-convert
> returned tuples, dicts, and DoOutputsTuples from composites into this new
> type.
>
> Full example:
>
> class SomeCustomComposite(PTransform):
>   def expand(self, pcoll):
>     def my_multi_do_fn(x):
>       if isinstance(x, int):
>         yield pvalue.TaggedOutput('number', x)
>       if isinstance(x, str):
>         yield pvalue.TaggedOutput('string', x)
>
>     def printer(x):
>       print(x)
>       yield x
>
>     outputs = pcoll | beam.ParDo(my_multi_do_fn).with_outputs()*    return 
> pvalue.PTuple({
>         'number': output.number | beam.ParDo(printer),
>         'string': output.string | beam.ParDo(printer)
>     })*
>
> p = beam.Pipeline()
> *main = p | SomeCustomComposite()*
>
> # Access PCollection by field name.
> numbers = *main.number* | beam.ParDo(...)
>
> # Access PCollection by tag.
> strings = *main['string']* | beam.ParDo(...)
>
> What do you think? Does this clear up the confusion of using multiple
> output PCollections in Python?
>
> Regards,
> Sam
>

Re: Confusing multiple output semantics in Python

Reply via email to