Thanks for the PR.

I think we should follow Java and allow non-unique labels, but not provide
automatic uniquification, In particular, the danger of using a counter is
that one can get accidental (and potentially hard to check) off-by-one
collisions. As a concrete example, imagine one partitions a dataset into
two collections, each followed by a similarly-named transform.

    --> B
  /
A
 \
   --> B

Uniquification would give something like

    --> B
  /
A
 \
   --> B_2

Suppose one then realizes there's a third case to handle, giving

    --> B
  /
A --> B
 \
   --> B

But this would be uniquified to

    --> B
  /
A --> B_2
 \
   --> B_3

where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
This is bad because an updating runner would then attribute old B_2's state
to the new B_2 (and also possibly mis-direct any inflight messages). At
least with the old, intersecting names we can detect this problem
rather than silently give corrupt data.


On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <joey.t...@schrodinger.com> wrote:

> For posterity: https://github.com/apache/beam/pull/28984
>
> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> I would definitely support a PR making this an option. Changing the
>> default would be a rather big change that would require more thought.
>>
>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>> might find inconvenient.
>>>
>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>> to put up a PR
>>>
>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>>
>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>> rely on it.
>>>>
>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>> with an option now. We should probably add the option to toggle Python 
>>>>> too.
>>>>> (Unclear what the default should be, but this probably ties into
>>>>> re-thinking how pipeline update should work.)
>>>>>
>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike 
>>>>>> the
>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>> exception.
>>>>>>
>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>> have a way to see the outputs of a run example? I have a vague memory 
>>>>>> that
>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>> but not sure if I just dreamt that up. Playing with the examples, I 
>>>>>> wasn't
>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>> alone.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>
>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>> u...@beam.apache.org> wrote:
>>>>>>
>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>> streaming or batch, or if it will be updated in the future, so the 
>>>>>>> decision
>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>
>>>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>>>> the naming only has to be distinct within a composite transform (or at 
>>>>>>> the
>>>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>>>> transform).
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>> causes the duplication error.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the 
>>>>>>>>>>>> same you
>>>>>>>>>>>> must specify a label. For example, the below will raise an 
>>>>>>>>>>>> exception:
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>> ```
>>>>>>>>>>>>
>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>
>>>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>>>> pipeline level probably comes up often, at least in my 
>>>>>>>>>>>> inexperience. For
>>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into 
>>>>>>>>>>>> three
>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the 
>>>>>>>>>>>> same type
>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>
>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often 
>>>>>>>>>>>> users will
>>>>>>>>>>>> really have no need for a disambiguating label as they're often 
>>>>>>>>>>>> just
>>>>>>>>>>>> writing unit tests that test a few different properties of their 
>>>>>>>>>>>> workflow.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>
>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an 
>>>>>>>>>>>>>> obvious way to
>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it 
>>>>>>>>>>>>>> makes me
>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If 
>>>>>>>>>>>>>> so, I'd be
>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to 
>>>>>>>>>>>>>> have to name
>>>>>>>>>>>>>> every instance of a transform that you might reuse in a 
>>>>>>>>>>>>>> pipeline, e.g. when
>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, 
>>>>>>>>>>>>>>> the only way
>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which 
>>>>>>>>>>>>>>> transforms in
>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good 
>>>>>>>>>>>>>>> way to auto
>>>>>>>>>>>>>>> generate names that will stay stable across updates - even 
>>>>>>>>>>>>>>> small changes to
>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, 
>>>>>>>>>>>>>>> which could
>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply 
>>>>>>>>>>>>>>> (if using
>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto 
>>>>>>>>>>>>>>> generate a
>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple 
>>>>>>>>>>>>>>>> of the same
>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a 
>>>>>>>>>>>>>>>> common complaint,
>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design 
>>>>>>>>>>>>>>>> was? My naive
>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user 
>>>>>>>>>>>>>>>> friendly to just
>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be 
>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Reply via email to