BeamJava and BeamPython have the exact same behavior: transform names
within must be distinct [1]. This is because we do not necessarily know at
pipeline construction time if the pipeline will be streaming or batch, or
if it will be updated in the future, so the decision was made to impose
this restriction up front. Both will auto-generate a name for you if one is
not given, but will do so deterministically (not depending on some global
context) to avoid potential update problems.

[1] Note that this applies to the fully qualified transform name, so the
naming only has to be distinct within a composite transform (or at the top
level--the pipeline itself is isomorphic to a single composite transform).

On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com> wrote:

> Cross posting this thread to dev@ to see if this is intentional behavior
> or if it's something worth changing for the python SDK
>
> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> wrote:
>
>> That suggests the default label is created as that, which indeed causes
>> the duplication error.
>>
>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>> Not sure what that suggests
>>>
>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org>
>>> wrote:
>>>
>>>> Looks like this is the current behaviour. If you have `t =
>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>> `Filter(identity_filter)`.
>>>>
>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <joey.t...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> You don't have to specify the names if the callable you pass in is
>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>> specify a label. For example, the below will raise an exception:
>>>>>
>>>>> ```
>>>>>         | beam.Filter(identity_filter)
>>>>>         | beam.Filter(identity_filter)
>>>>> ```
>>>>>
>>>>> Here's an example on playground that shows the error message you get
>>>>> [1]. I marked every line I added with a "# ++".
>>>>>
>>>>> It's a contrived example, but using a map or filter at the same
>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>> different pcolls, runs some transforms on them, and then runs the same 
>>>>> type
>>>>> of filter on each of them.
>>>>>
>>>>> The case that happens most often for me is using the `assert_that` [2]
>>>>> testing transform. In this case, I think often users will really have no
>>>>> need for a disambiguating label as they're often just writing unit tests
>>>>> that test a few different properties of their workflow.
>>>>>
>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>> [2]
>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>> u...@beam.apache.org> wrote:
>>>>>
>>>>>> If I understand the question correctly, you don't have to specify
>>>>>> those names.
>>>>>>
>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>> stable / deterministic graph.
>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>
>>>>>> See an example here
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <joey.t...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>> Can you elaborate?
>>>>>>>
>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes 
>>>>>>> me
>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd 
>>>>>>> be
>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to 
>>>>>>> name
>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. 
>>>>>>> when
>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>> for runners that support updating pipelines in place, the only way to 
>>>>>>>> do so
>>>>>>>> safely is if the runner can perfectly identify which transforms in the 
>>>>>>>> new
>>>>>>>> graph match the ones in the old graph. There's no good way to auto 
>>>>>>>> generate
>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>> pipeline might change the order of nodes in the graph, which could 
>>>>>>>> result
>>>>>>>> in a corrupted update.
>>>>>>>>
>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>> these names for you! When you call PCollection.apply (if using 
>>>>>>>> BeamJava),
>>>>>>>> simply omit the name parameter and Beam will auto generate a unique 
>>>>>>>> name
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>
>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>> transforms without labeling them. I figure this must be a common 
>>>>>>>>> complaint,
>>>>>>>>> so I was just curious, what the rationale behind this design was? My 
>>>>>>>>> naive
>>>>>>>>> thought off the top of my head is that it'd be more user friendly to 
>>>>>>>>> just
>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>> something
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Joey
>>>>>>>>>
>>>>>>>>

Reply via email to