Cross posting this thread to dev@ to see if this is intentional behavior or if it's something worth changing for the python SDK
On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> wrote: > That suggests the default label is created as that, which indeed causes > the duplication error. > > On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Not sure what that suggests >> >> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org> wrote: >> >>> Looks like this is the current behaviour. If you have `t = >>> beam.Filter(identity_filter)`, `t.label` is defined as >>> `Filter(identity_filter)`. >>> >>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> You don't have to specify the names if the callable you pass in is >>>> /different/ for two `beam.Map`s, but if the callable is the same you must >>>> specify a label. For example, the below will raise an exception: >>>> >>>> ``` >>>> | beam.Filter(identity_filter) >>>> | beam.Filter(identity_filter) >>>> ``` >>>> >>>> Here's an example on playground that shows the error message you get >>>> [1]. I marked every line I added with a "# ++". >>>> >>>> It's a contrived example, but using a map or filter at the same >>>> pipeline level probably comes up often, at least in my inexperience. For >>>> example, you. might have a pipeline that partitions a pcoll into three >>>> different pcolls, runs some transforms on them, and then runs the same type >>>> of filter on each of them. >>>> >>>> The case that happens most often for me is using the `assert_that` [2] >>>> testing transform. In this case, I think often users will really have no >>>> need for a disambiguating label as they're often just writing unit tests >>>> that test a few different properties of their workflow. >>>> >>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>> [2] >>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that >>>> >>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user < >>>> u...@beam.apache.org> wrote: >>>> >>>>> If I understand the question correctly, you don't have to specify >>>>> those names. >>>>> >>>>> As Reuven pointed out, it is probably a good idea so you have a stable >>>>> / deterministic graph. >>>>> But in the Python SDK, you can simply use pcollection | map_fn, >>>>> instead of pcollection | 'Map' >> map_fn. >>>>> >>>>> See an example here >>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116 >>>>> >>>>> >>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can >>>>>> you elaborate? >>>>>> >>>>>> I forgot to mention my question is posed from the context of a python >>>>>> SDK user, and afaict, there doesn't seem to be an obvious way to >>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me >>>>>> wonder if the python SDK could support it as well though... (If so, I'd >>>>>> be >>>>>> happy to do implement it). Currently, it's fairly tedious to have to name >>>>>> every instance of a transform that you might reuse in a pipeline, e.g. >>>>>> when >>>>>> reapplying the same Map on different pcollections. >>>>>> >>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user < >>>>>> u...@beam.apache.org> wrote: >>>>>> >>>>>>> Are you talking about transform names? The main reason was because >>>>>>> for runners that support updating pipelines in place, the only way to >>>>>>> do so >>>>>>> safely is if the runner can perfectly identify which transforms in the >>>>>>> new >>>>>>> graph match the ones in the old graph. There's no good way to auto >>>>>>> generate >>>>>>> names that will stay stable across updates - even small changes to the >>>>>>> pipeline might change the order of nodes in the graph, which could >>>>>>> result >>>>>>> in a corrupted update. >>>>>>> >>>>>>> However, if you don't care about update, Beam can auto generate >>>>>>> these names for you! When you call PCollection.apply (if using >>>>>>> BeamJava), >>>>>>> simply omit the name parameter and Beam will auto generate a unique name >>>>>>> for you. >>>>>>> >>>>>>> Reuven >>>>>>> >>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran < >>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>> >>>>>>>> After writing a few pipelines now, I keep getting tripped up from >>>>>>>> accidentally have duplicate labels from using multiple of the same >>>>>>>> transforms without labeling them. I figure this must be a common >>>>>>>> complaint, >>>>>>>> so I was just curious, what the rationale behind this design was? My >>>>>>>> naive >>>>>>>> thought off the top of my head is that it'd be more user friendly to >>>>>>>> just >>>>>>>> auto increment duplicate transforms, but I figure I must be missing >>>>>>>> something >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Joey >>>>>>>> >>>>>>>