For posterity: https://github.com/apache/beam/pull/28984
On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com> wrote: > I would definitely support a PR making this an option. Changing the > default would be a rather big change that would require more thought. > > On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt >> Apache Beam at my company and I'm trying to foresee parts of the API they >> might find inconvenient. >> >> If there's a conclusion to make the behavior similar to java, I'm happy >> to put up a PR >> >> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com> >> wrote: >> >>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be >>> a very sticky toggle since it'd be easy for PTransforms to accidentally >>> rely on it. >>> >>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Huh. This used to be a hard error in Java, but I guess it's togglable >>>> with an option now. We should probably add the option to toggle Python too. >>>> (Unclear what the default should be, but this probably ties into >>>> re-thinking how pipeline update should work.) >>>> >>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com> >>>> wrote: >>>> >>>>> Makes sense that the requirement is the same, but is the label >>>>> auto-generation behavior the same? I modified the BeamJava >>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the >>>>> BeamPython example I posted before, it just warns instead of throwing an >>>>> exception. >>>>> >>>>> Tangentially, is it expected that the Beam playground examples don't >>>>> have a way to see the outputs of a run example? I have a vague memory that >>>>> there used to be a way to navigate to an output file after it's generated >>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't >>>>> positive if my runs were actually succeeding or not based on the stdout >>>>> alone. >>>>> >>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2 >>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2> >>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>> >>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user < >>>>> u...@beam.apache.org> wrote: >>>>> >>>>>> BeamJava and BeamPython have the exact same behavior: transform names >>>>>> within must be distinct [1]. This is because we do not necessarily know >>>>>> at >>>>>> pipeline construction time if the pipeline will be streaming or batch, or >>>>>> if it will be updated in the future, so the decision was made to impose >>>>>> this restriction up front. Both will auto-generate a name for you if one >>>>>> is >>>>>> not given, but will do so deterministically (not depending on some global >>>>>> context) to avoid potential update problems. >>>>>> >>>>>> [1] Note that this applies to the fully qualified transform name, so >>>>>> the naming only has to be distinct within a composite transform (or at >>>>>> the >>>>>> top level--the pipeline itself is isomorphic to a single composite >>>>>> transform). >>>>>> >>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com> >>>>>> wrote: >>>>>> >>>>>>> Cross posting this thread to dev@ to see if this is intentional >>>>>>> behavior or if it's something worth changing for the python SDK >>>>>>> >>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> That suggests the default label is created as that, which indeed >>>>>>>> causes the duplication error. >>>>>>>> >>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <joey.t...@schrodinger.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Not sure what that suggests >>>>>>>>> >>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Looks like this is the current behaviour. If you have `t = >>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as >>>>>>>>>> `Filter(identity_filter)`. >>>>>>>>>> >>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran < >>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>> >>>>>>>>>>> You don't have to specify the names if the callable you pass in >>>>>>>>>>> is /different/ for two `beam.Map`s, but if the callable is the >>>>>>>>>>> same you >>>>>>>>>>> must specify a label. For example, the below will raise an >>>>>>>>>>> exception: >>>>>>>>>>> >>>>>>>>>>> ``` >>>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>>> ``` >>>>>>>>>>> >>>>>>>>>>> Here's an example on playground that shows the error message you >>>>>>>>>>> get [1]. I marked every line I added with a "# ++". >>>>>>>>>>> >>>>>>>>>>> It's a contrived example, but using a map or filter at the same >>>>>>>>>>> pipeline level probably comes up often, at least in my >>>>>>>>>>> inexperience. For >>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into >>>>>>>>>>> three >>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the >>>>>>>>>>> same type >>>>>>>>>>> of filter on each of them. >>>>>>>>>>> >>>>>>>>>>> The case that happens most often for me is using the >>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often >>>>>>>>>>> users will >>>>>>>>>>> really have no need for a disambiguating label as they're often just >>>>>>>>>>> writing unit tests that test a few different properties of their >>>>>>>>>>> workflow. >>>>>>>>>>> >>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>>>>>>>> [2] >>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user < >>>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> If I understand the question correctly, you don't have to >>>>>>>>>>>> specify those names. >>>>>>>>>>>> >>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a >>>>>>>>>>>> stable / deterministic graph. >>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, >>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn. >>>>>>>>>>>> >>>>>>>>>>>> See an example here >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran < >>>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in >>>>>>>>>>>>> place". Can you elaborate? >>>>>>>>>>>>> >>>>>>>>>>>>> I forgot to mention my question is posed from the context of a >>>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious >>>>>>>>>>>>> way to >>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it >>>>>>>>>>>>> makes me >>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If >>>>>>>>>>>>> so, I'd be >>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have >>>>>>>>>>>>> to name >>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, >>>>>>>>>>>>> e.g. when >>>>>>>>>>>>> reapplying the same Map on different pcollections. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user < >>>>>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Are you talking about transform names? The main reason was >>>>>>>>>>>>>> because for runners that support updating pipelines in place, >>>>>>>>>>>>>> the only way >>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which >>>>>>>>>>>>>> transforms in >>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good >>>>>>>>>>>>>> way to auto >>>>>>>>>>>>>> generate names that will stay stable across updates - even small >>>>>>>>>>>>>> changes to >>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which >>>>>>>>>>>>>> could >>>>>>>>>>>>>> result in a corrupted update. >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, if you don't care about update, Beam can auto >>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply >>>>>>>>>>>>>> (if using >>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto >>>>>>>>>>>>>> generate a >>>>>>>>>>>>>> unique name for you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Reuven >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran < >>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up >>>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of >>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a >>>>>>>>>>>>>>> common complaint, >>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design >>>>>>>>>>>>>>> was? My naive >>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user >>>>>>>>>>>>>>> friendly to just >>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be >>>>>>>>>>>>>>> missing >>>>>>>>>>>>>>> something >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Joey >>>>>>>>>>>>>>> >>>>>>>>>>>>>>