Thanks for the PR. I think we should follow Java and allow non-unique labels, but not provide automatic uniquification, In particular, the danger of using a counter is that one can get accidental (and potentially hard to check) off-by-one collisions. As a concrete example, imagine one partitions a dataset into two collections, each followed by a similarly-named transform.
--> B / A \ --> B Uniquification would give something like --> B / A \ --> B_2 Suppose one then realizes there's a third case to handle, giving --> B / A --> B \ --> B But this would be uniquified to --> B / A --> B_2 \ --> B_3 where the old B_2 got renamed to B_3 and a new B_2 got put in its place. This is bad because an updating runner would then attribute old B_2's state to the new B_2 (and also possibly mis-direct any inflight messages). At least with the old, intersecting names we can detect this problem rather than silently give corrupt data. On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <joey.t...@schrodinger.com> wrote: > For posterity: https://github.com/apache/beam/pull/28984 > > On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com> > wrote: > >> I would definitely support a PR making this an option. Changing the >> default would be a rather big change that would require more thought. >> >> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com> >> wrote: >> >>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt >>> Apache Beam at my company and I'm trying to foresee parts of the API they >>> might find inconvenient. >>> >>> If there's a conclusion to make the behavior similar to java, I'm happy >>> to put up a PR >>> >>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd >>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally >>>> rely on it. >>>> >>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> Huh. This used to be a hard error in Java, but I guess it's togglable >>>>> with an option now. We should probably add the option to toggle Python >>>>> too. >>>>> (Unclear what the default should be, but this probably ties into >>>>> re-thinking how pipeline update should work.) >>>>> >>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> Makes sense that the requirement is the same, but is the label >>>>>> auto-generation behavior the same? I modified the BeamJava >>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike >>>>>> the >>>>>> BeamPython example I posted before, it just warns instead of throwing an >>>>>> exception. >>>>>> >>>>>> Tangentially, is it expected that the Beam playground examples don't >>>>>> have a way to see the outputs of a run example? I have a vague memory >>>>>> that >>>>>> there used to be a way to navigate to an output file after it's generated >>>>>> but not sure if I just dreamt that up. Playing with the examples, I >>>>>> wasn't >>>>>> positive if my runs were actually succeeding or not based on the stdout >>>>>> alone. >>>>>> >>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2 >>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2> >>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>>> >>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user < >>>>>> u...@beam.apache.org> wrote: >>>>>> >>>>>>> BeamJava and BeamPython have the exact same behavior: >>>>>>> transform names within must be distinct [1]. This is because we do not >>>>>>> necessarily know at pipeline construction time if the pipeline will be >>>>>>> streaming or batch, or if it will be updated in the future, so the >>>>>>> decision >>>>>>> was made to impose this restriction up front. Both will auto-generate a >>>>>>> name for you if one is not given, but will do so deterministically (not >>>>>>> depending on some global context) to avoid potential update problems. >>>>>>> >>>>>>> [1] Note that this applies to the fully qualified transform name, so >>>>>>> the naming only has to be distinct within a composite transform (or at >>>>>>> the >>>>>>> top level--the pipeline itself is isomorphic to a single composite >>>>>>> transform). >>>>>>> >>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Cross posting this thread to dev@ to see if this is intentional >>>>>>>> behavior or if it's something worth changing for the python SDK >>>>>>>> >>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> That suggests the default label is created as that, which indeed >>>>>>>>> causes the duplication error. >>>>>>>>> >>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran < >>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>> >>>>>>>>>> Not sure what that suggests >>>>>>>>>> >>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Looks like this is the current behaviour. If you have `t = >>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as >>>>>>>>>>> `Filter(identity_filter)`. >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran < >>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> You don't have to specify the names if the callable you pass in >>>>>>>>>>>> is /different/ for two `beam.Map`s, but if the callable is the >>>>>>>>>>>> same you >>>>>>>>>>>> must specify a label. For example, the below will raise an >>>>>>>>>>>> exception: >>>>>>>>>>>> >>>>>>>>>>>> ``` >>>>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>>>> ``` >>>>>>>>>>>> >>>>>>>>>>>> Here's an example on playground that shows the error message >>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++". >>>>>>>>>>>> >>>>>>>>>>>> It's a contrived example, but using a map or filter at the same >>>>>>>>>>>> pipeline level probably comes up often, at least in my >>>>>>>>>>>> inexperience. For >>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into >>>>>>>>>>>> three >>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the >>>>>>>>>>>> same type >>>>>>>>>>>> of filter on each of them. >>>>>>>>>>>> >>>>>>>>>>>> The case that happens most often for me is using the >>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often >>>>>>>>>>>> users will >>>>>>>>>>>> really have no need for a disambiguating label as they're often >>>>>>>>>>>> just >>>>>>>>>>>> writing unit tests that test a few different properties of their >>>>>>>>>>>> workflow. >>>>>>>>>>>> >>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>>>>>>>>> [2] >>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user < >>>>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If I understand the question correctly, you don't have to >>>>>>>>>>>>> specify those names. >>>>>>>>>>>>> >>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have >>>>>>>>>>>>> a stable / deterministic graph. >>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, >>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn. >>>>>>>>>>>>> >>>>>>>>>>>>> See an example here >>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran < >>>>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in >>>>>>>>>>>>>> place". Can you elaborate? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I forgot to mention my question is posed from the context of >>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an >>>>>>>>>>>>>> obvious way to >>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it >>>>>>>>>>>>>> makes me >>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If >>>>>>>>>>>>>> so, I'd be >>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to >>>>>>>>>>>>>> have to name >>>>>>>>>>>>>> every instance of a transform that you might reuse in a >>>>>>>>>>>>>> pipeline, e.g. when >>>>>>>>>>>>>> reapplying the same Map on different pcollections. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user < >>>>>>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are you talking about transform names? The main reason was >>>>>>>>>>>>>>> because for runners that support updating pipelines in place, >>>>>>>>>>>>>>> the only way >>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which >>>>>>>>>>>>>>> transforms in >>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good >>>>>>>>>>>>>>> way to auto >>>>>>>>>>>>>>> generate names that will stay stable across updates - even >>>>>>>>>>>>>>> small changes to >>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, >>>>>>>>>>>>>>> which could >>>>>>>>>>>>>>> result in a corrupted update. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto >>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply >>>>>>>>>>>>>>> (if using >>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto >>>>>>>>>>>>>>> generate a >>>>>>>>>>>>>>> unique name for you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Reuven >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran < >>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped >>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple >>>>>>>>>>>>>>>> of the same >>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a >>>>>>>>>>>>>>>> common complaint, >>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design >>>>>>>>>>>>>>>> was? My naive >>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user >>>>>>>>>>>>>>>> friendly to just >>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be >>>>>>>>>>>>>>>> missing >>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Joey >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>