I would definitely support a PR making this an option. Changing the default would be a rather big change that would require more thought.
On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com> wrote: > Bump on this. Sorry to pester - I'm trying to get a few teams to adopt > Apache Beam at my company and I'm trying to foresee parts of the API they > might find inconvenient. > > If there's a conclusion to make the behavior similar to java, I'm happy to > put up a PR > > On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com> wrote: > >> Is it really toggleable in Java? I imagine that if it's a toggle it'd be >> a very sticky toggle since it'd be easy for PTransforms to accidentally >> rely on it. >> >> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> Huh. This used to be a hard error in Java, but I guess it's togglable >>> with an option now. We should probably add the option to toggle Python too. >>> (Unclear what the default should be, but this probably ties into >>> re-thinking how pipeline update should work.) >>> >>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> Makes sense that the requirement is the same, but is the label >>>> auto-generation behavior the same? I modified the BeamJava >>>> wordcount example[1] to do the regex filter twice in a row, and unlike the >>>> BeamPython example I posted before, it just warns instead of throwing an >>>> exception. >>>> >>>> Tangentially, is it expected that the Beam playground examples don't >>>> have a way to see the outputs of a run example? I have a vague memory that >>>> there used to be a way to navigate to an output file after it's generated >>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't >>>> positive if my runs were actually succeeding or not based on the stdout >>>> alone. >>>> >>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2 >>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2> >>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>> >>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user < >>>> u...@beam.apache.org> wrote: >>>> >>>>> BeamJava and BeamPython have the exact same behavior: transform names >>>>> within must be distinct [1]. This is because we do not necessarily know at >>>>> pipeline construction time if the pipeline will be streaming or batch, or >>>>> if it will be updated in the future, so the decision was made to impose >>>>> this restriction up front. Both will auto-generate a name for you if one >>>>> is >>>>> not given, but will do so deterministically (not depending on some global >>>>> context) to avoid potential update problems. >>>>> >>>>> [1] Note that this applies to the fully qualified transform name, so >>>>> the naming only has to be distinct within a composite transform (or at the >>>>> top level--the pipeline itself is isomorphic to a single composite >>>>> transform). >>>>> >>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> Cross posting this thread to dev@ to see if this is intentional >>>>>> behavior or if it's something worth changing for the python SDK >>>>>> >>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> >>>>>> wrote: >>>>>> >>>>>>> That suggests the default label is created as that, which indeed >>>>>>> causes the duplication error. >>>>>>> >>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <joey.t...@schrodinger.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Not sure what that suggests >>>>>>>> >>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Looks like this is the current behaviour. If you have `t = >>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as >>>>>>>>> `Filter(identity_filter)`. >>>>>>>>> >>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran < >>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>> >>>>>>>>>> You don't have to specify the names if the callable you pass in >>>>>>>>>> is /different/ for two `beam.Map`s, but if the callable is the same >>>>>>>>>> you >>>>>>>>>> must specify a label. For example, the below will raise an exception: >>>>>>>>>> >>>>>>>>>> ``` >>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>> | beam.Filter(identity_filter) >>>>>>>>>> ``` >>>>>>>>>> >>>>>>>>>> Here's an example on playground that shows the error message you >>>>>>>>>> get [1]. I marked every line I added with a "# ++". >>>>>>>>>> >>>>>>>>>> It's a contrived example, but using a map or filter at the same >>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. >>>>>>>>>> For >>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into >>>>>>>>>> three >>>>>>>>>> different pcolls, runs some transforms on them, and then runs the >>>>>>>>>> same type >>>>>>>>>> of filter on each of them. >>>>>>>>>> >>>>>>>>>> The case that happens most often for me is using the >>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often >>>>>>>>>> users will >>>>>>>>>> really have no need for a disambiguating label as they're often just >>>>>>>>>> writing unit tests that test a few different properties of their >>>>>>>>>> workflow. >>>>>>>>>> >>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>>>>>>> [2] >>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that >>>>>>>>>> >>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user < >>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> If I understand the question correctly, you don't have to >>>>>>>>>>> specify those names. >>>>>>>>>>> >>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a >>>>>>>>>>> stable / deterministic graph. >>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, >>>>>>>>>>> instead of pcollection | 'Map' >> map_fn. >>>>>>>>>>> >>>>>>>>>>> See an example here >>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran < >>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in >>>>>>>>>>>> place". Can you elaborate? >>>>>>>>>>>> >>>>>>>>>>>> I forgot to mention my question is posed from the context of a >>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious >>>>>>>>>>>> way to >>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it >>>>>>>>>>>> makes me >>>>>>>>>>>> wonder if the python SDK could support it as well though... (If >>>>>>>>>>>> so, I'd be >>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have >>>>>>>>>>>> to name >>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, >>>>>>>>>>>> e.g. when >>>>>>>>>>>> reapplying the same Map on different pcollections. >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user < >>>>>>>>>>>> u...@beam.apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Are you talking about transform names? The main reason was >>>>>>>>>>>>> because for runners that support updating pipelines in place, the >>>>>>>>>>>>> only way >>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which >>>>>>>>>>>>> transforms in >>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good >>>>>>>>>>>>> way to auto >>>>>>>>>>>>> generate names that will stay stable across updates - even small >>>>>>>>>>>>> changes to >>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which >>>>>>>>>>>>> could >>>>>>>>>>>>> result in a corrupted update. >>>>>>>>>>>>> >>>>>>>>>>>>> However, if you don't care about update, Beam can auto >>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if >>>>>>>>>>>>> using >>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto >>>>>>>>>>>>> generate a >>>>>>>>>>>>> unique name for you. >>>>>>>>>>>>> >>>>>>>>>>>>> Reuven >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran < >>>>>>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up >>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of >>>>>>>>>>>>>> the same >>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common >>>>>>>>>>>>>> complaint, >>>>>>>>>>>>>> so I was just curious, what the rationale behind this design >>>>>>>>>>>>>> was? My naive >>>>>>>>>>>>>> thought off the top of my head is that it'd be more user >>>>>>>>>>>>>> friendly to just >>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be >>>>>>>>>>>>>> missing >>>>>>>>>>>>>> something >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> Joey >>>>>>>>>>>>>> >>>>>>>>>>>>>