Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Robert Bradshaw Tue, 16 Oct 2018 01:28:40 -0700

Yes, we don't know how to parse and/or validate it.

On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lc...@google.com> wrote:


> I see, is the issue that we currently are using a JSON representation for
> options when being serialized and when we get some unknown option, we don't
> know how to convert it into its JSON form?
>
> On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote:
>> >
>> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>> >>
>> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
>> >> >
>> >> > I agree with the sentiment for better error checking.
>> >> >
>> >> > We can try to make it such that the SDK can "fetch" the set of
>> options that the runner supports by making a call to the Job API. The API
>> could return a list of option names (descriptions for --help purposes and
>> also potentially the expected format) which would remove the worry around
>> "unknown" options. Yes I understand to be able to make the Job API call, we
>> may need to parse some options from the args parameters first and then
>> parse the unknown options after they are fetched.
>> >>
>> >> This is an interesting idea, but seems it could get quite complicated.
>> >> E.g. for delegating runners, one would first read the options to
>> >> determine which runner to fetch the options from, which would then
>> >> return a set of options that possibly depends on the values of some of
>> >> its options...
>> >>
>> >> > Alternatively, we can choose an explicit format upfront.
>> >> > To expand on the exact format for --runner_option=..., here are some
>> different ideas:
>> >> > 1) Specified multiple times, each one is an explicit flag
>> >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>> --runner_option=--foo=baz2
>> >>
>> >> I'm -1 on this format. We should move away from the idea that options
>> >> == flags (as that doesn't compose well with other libraries that do
>> >> their own flags parsing). The ability to parse a set of flags into
>> >> options is just a convenience that an author may (or may not) choose
>> >> to use (e.g. when running pipelines a long-lived process like a
>> >> service or a notebook, the command line flags are almost certainly not
>> >> the right interface).
>> >>
>> >> > 2) specified multiple times, we drop the explicit flag
>> >> > --runner_option=blah=bar --runner_option=foo=baz1
>> --runner_option=foo=baz2
>> >>
>> >> This or (4) is my preference.
>> >>
>> >> > 3) we use a string which the runner can choose to interpret however
>> they want (JSON/XML shown below)
>> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >> >
>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>> >>
>> >> This would make validation hard. Also, I think it makes sense for some
>> >> runner options to be "shared" (parallelism") by convention, so letting
>> >> it be a free-form string wouldn't allow different runners to inspect
>> >> different bits.
>> >>
>> >> We should consider if we should use urns for namespacing, and
>> >> assigning semantic meaning to strings, here.
>> >>
>> >> > 4) we use a string which must be a specific format such as JSON
>> (allows the SDK to do simple validation):
>> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >>
>> >> I like this in that at least some validation can be performed, and
>> >> expectations of how to format richer types. On the other hand it gets
>> >> a bit verbose, given that most (I'd imagine) options will be simple.
>> >> As with normal options,
>> >>
>> >>     --option1=value1 --option2=value2
>> >>
>> >> is shorthand for {"option1": value1, "option2": value2}.
>> >>
>> > I lean to 4 the most. With 2, you run into issues of what does
>> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
>> > Is it an error or list of lists or concatenated. Similar issues for map
>> types represented via JSON object {...}
>>
>> We can err to be on the safe side unless/until an argument can be made
>> that merging is more natural. I just think this will be excessively
>> verbose to use.
>>
>> >> > I would strongly suggest that we go with the "fetch" approach, since
>> this makes the set of options discoverable and helps users find errors much
>> earlier in their pipeline.
>> >>
>> >> This seems like an advanced feature that SDKs may want to support, but
>> >> I wouldn't want to require this complexity for bootstrapping an SDK.
>> >>
>> > SDKs that are starting off wouldn't need to "fetch" options, they could
>> choose to not support runner options or they could choose to pass all
>> options through to the runner blindly. Fetching the options only provides
>> the SDK the ability to provide error checking upfront and useful error/help
>> messages.
>>
>> But how to even pass all options through blindly is exactly the
>> difficulty we're running into here.
>>
>> >> Regarding always keeping runner options separate, +1, though I'm not
>> >> sure the line is always clear.
>> >>
>> >>
>> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>> >> >>
>> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <m...@apache.org>
>> wrote:
>> >> >> >
>> >> >> > I agree that the current approach breaks the pipeline options
>> contract
>> >> >> > because "unknown" options get parsed in the same way as options
>> which
>> >> >> > have been defined by the user.
>> >> >>
>> >> >> FWIW, I think we're already breaking this "contract." Unknown
>> options
>> >> >> are silently ignored; with this change we just change how we record
>> >> >> them. It still feels a bit hacky though.
>> >> >>
>> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only
>> allows
>> >> >> > true/false flags. We want to pass all types of pipeline options
>> to the
>> >> >> > Runner.
>> >> >>
>> >> >> Experiments is an arbitrary set of strings, which can be of the form
>> >> >> "param=value" if that's useful. (Dataflow does this.) There is,
>> again,
>> >> >> no namespacing on the param names, but we could user urns or impose
>> >> >> some other structure here.
>> >> >>
>> >> >> > How to solve this?
>> >> >> >
>> >> >> > 1) Add all options of all Runners to each SDK
>> >> >> > We added some of the FlinkRunner options to the Python SDK but
>> realized
>> >> >> > syncing is rather cumbersome in the long term. However, we want
>> the most
>> >> >> > important options to be validated on the client side.
>> >> >>
>> >> >> I don't think this is sustainable in the long run. However, thinking
>> >> >> about this, in the worse case validation happens after construction
>> >> >> but before execution (as with much of our other validation) so it
>> >> >> isn't that bad.
>> >> >>
>> >> >> > 2) Pass "unknown" options via a separate list in the Proto which
>> can
>> >> >> > only be accessed internally by the Runners. This still allows
>> passing
>> >> >> > arbitrary options but we wouldn't leak unknown options and
>> display them
>> >> >> > as top-level options.
>> >> >>
>> >> >> I think there needs to be a way for the user to communicate values
>> >> >> directly to the runner regardless of the SDK. My preference would be
>> >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> >> >> than scooping up all unknown flags at command line parsing time.
>> >> >> Perhaps an SDK that is aware of some runners could choose to lift
>> >> >> these as top-level options, but still pass them as runner options.
>> >> >>
>> >> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >> >> > > The current release branch
>> >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
>> after the
>> >> >> > > revert went in.  Sent out
>> https://github.com/apache/beam/pull/6683 as a
>> >> >> > > revert of the revert.  Regarding your comment above, I can help
>> out with
>> >> >> > > the design / PR reviews for common Python code as you suggest.
>> >> >> > >
>> >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <t...@apache.org
>> >> >> > > <mailto:t...@apache.org>> wrote:
>> >> >> > >
>> >> >> > >     Thanks, will tag you and looking forward to feedback so we
>> can
>> >> >> > >     ensure that changes work for everyone.
>> >> >> > >
>> >> >> > >     Looking at the PR, I see agreement from Max to revert the
>> change on
>> >> >> > >     the release branch, but not in master. Would you mind to
>> restore it
>> >> >> > >     in master?
>> >> >> > >
>> >> >> > >     Thanks
>> >> >> > >
>> >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <
>> al...@google.com
>> >> >> > >     <mailto:al...@google.com>> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
>> c...@google.com
>> >> >> > >         <mailto:c...@google.com>> wrote:
>> >> >> > >
>> >> >> > >             What I mean is that a user may find that it works
>> for them
>> >> >> > >             to pass "--myarg blah" and access it as
>> "options.myarg"
>> >> >> > >             without explicitly defining a "my_arg" flag due to
>> the added
>> >> >> > >             logic.  This is not the intended behavior and we
>> may want to
>> >> >> > >             change this implementation detail in the future.
>> However,
>> >> >> > >             having this logic in a released version makes it
>> hard to
>> >> >> > >             change this behavior since users may erroneously
>> depend on
>> >> >> > >             this undocumented behavior.  Instead, we should
>> namespace /
>> >> >> > >             scope this so that it is obvious that this is meant
>> for
>> >> >> > >             runner (and not Beam user) consumption.
>> >> >> > >
>> >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> >> >> > >             <t...@apache.org <mailto:t...@apache.org>> wrote:
>> >> >> > >
>> >> >> > >                 Can you please elaborate more what practical
>> problems
>> >> >> > >                 this introduces for users?
>> >> >> > >
>> >> >> > >                 I can see that this change allows a user to
>> specify a
>> >> >> > >                 runner specific option, which in the future may
>> change
>> >> >> > >                 because we decide to scope differently. If this
>> only
>> >> >> > >                 affects users of the portable Flink runner
>> (like us),
>> >> >> > >                 then no need to revert, because at this early
>> stage we
>> >> >> > >                 prefer something that works over being blocked.
>> >> >> > >
>> >> >> > >                 It would also be really great if some of the
>> core Python
>> >> >> > >                 SDK developers could help out with the design
>> aspects
>> >> >> > >                 and PR reviews of changes that affect common
>> Python
>> >> >> > >                 code. Anyone who specifically wants to be
>> tagged on
>> >> >> > >                 relevant JIRAs and PRs?
>> >> >> > >
>> >> >> > >
>> >> >> > >         I would be happy to be tagged, and I can also help with
>> >> >> > >         including other relevant folks whenever possible. In
>> general I
>> >> >> > >         think Robert, Charles, myself are good candidates.
>> >> >> > >
>> >> >> > >
>> >> >> > >                 Thanks
>> >> >> > >
>> >> >> > >
>> >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> >> >> > >                 <al...@google.com <mailto:al...@google.com>>
>> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles
>> Chen
>> >> >> > >                     <c...@google.com <mailto:c...@google.com>>
>> wrote:
>> >> >> > >
>> >> >> > >                         For context, I made comments on
>> >> >> > >
>> https://github.com/apache/beam/pull/6600 noting
>> >> >> > >                         that the changes being made were not
>> good for
>> >> >> > >                         Beam backwards-compatibility.  The
>> change as is
>> >> >> > >                         allows users to use pipeline options
>> without
>> >> >> > >                         explicitly defining them, which is not
>> the type
>> >> >> > >                         of usage we would like to encourage
>> since we
>> >> >> > >                         prefer to be explicit whenever
>> possible.  If
>> >> >> > >                         users write pipelines with this sort of
>> pattern,
>> >> >> > >                         they will potentially encounter pain
>> when
>> >> >> > >                         upgrading to a later version since this
>> is an
>> >> >> > >                         implementation detail and not an
>> officially
>> >> >> > >                         supported pattern.  I agree with the
>> comments
>> >> >> > >                         above that this is ultimately a scoping
>> issue.
>> >> >> > >                         I would not have a problem with these
>> changes if
>> >> >> > >                         they were explicitly scoped under
>> either a
>> >> >> > >                         runner or unparsed options namespace.
>> >> >> > >
>> >> >> > >                         As a second note, since the 2.8.0
>> release is
>> >> >> > >                         being cut right now, because of these
>> >> >> > >                         backwards-compatibility concerns, I
>> would
>> >> >> > >                         suggest reverting these changes, at
>> least until
>> >> >> > >                         2.8.0 is cut, so we can have a
>> discussion here
>> >> >> > >                         before committing to and releasing any
>> API-level
>> >> >> > >                         changes.
>> >> >> > >
>> >> >> > >
>> >> >> > >                     +1 I would like to revert the changes in
>> order not
>> >> >> > >                     rush this into the release. Once this
>> discussion
>> >> >> > >                     results in an agreement changes can be
>> brought back.
>> >> >> > >
>> >> >> > >
>> >> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning
>> Rohde
>> >> >> > >                         <hero...@google.com <mailto:
>> hero...@google.com>>
>> >> >> > >                         wrote:
>> >> >> > >
>> >> >> > >                             Agree that pipeline options lack
>> some
>> >> >> > >                             mechanism for scoping. It is also
>> not always
>> >> >> > >                             possible distinguish options meant
>> to be
>> >> >> > >                             consumed at pipeline construction
>> time, by
>> >> >> > >                             the runner, by the SDK harness, by
>> the user
>> >> >> > >                             code or any combination -- and this
>> causes
>> >> >> > >                             confusion every now and then.
>> >> >> > >
>> >> >> > >                             For Dataflow, we have been using
>> >> >> > >                             "experiments" for arbitrary
>> runner-specific
>> >> >> > >                             options. It's simply a string list
>> pipeline
>> >> >> > >                             option that all SDKs support and,
>> for Go at
>> >> >> > >                             least, is sent to portable runners.
>> Flink
>> >> >> > >                             can do the same in the short term
>> to move
>> >> >> > >                             forward.
>> >> >> > >
>> >> >> > >                             Henning
>> >> >> > >
>> >> >> > >
>> >> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM
>> Thomas Weise
>> >> >> > >                             <t...@apache.org <mailto:
>> t...@apache.org>> wrote:
>> >> >> > >
>> >> >> > >                                 [moving to the list]
>> >> >> > >
>> >> >> > >                                 The requirement driving this
>> part of the
>> >> >> > >                                 change was to allow a user to
>> specify
>> >> >> > >                                 pipeline options that a runner
>> supports
>> >> >> > >                                 without having to declare those
>> in each
>> >> >> > >                                 language SDK.
>> >> >> > >
>> >> >> > >                                 In the specific scenario, we
>> have
>> >> >> > >                                 options that the Flink runner
>> supports
>> >> >> > >                                 (and can validate), that are not
>> >> >> > >                                 enumerated in the Python SDK.
>> >> >> > >
>> >> >> > >                                 I think we have a bigger
>> problem scoping
>> >> >> > >                                 pipeline options. For example,
>> the
>> >> >> > >                                 runner options are dumped into
>> the SDK
>> >> >> > >                                 worker. There is also a
>> possibility of
>> >> >> > >                                 name collisions. So I think
>> this would
>> >> >> > >                                 benefit from broader feedback.
>> >> >> > >
>> >> >> > >                                 Thanks,
>> >> >> > >                                 Thomas
>> >> >> > >
>> >> >> > >
>> >> >> > >                                 ---------- Forwarded message
>> ---------
>> >> >> > >                                 From: *Charles Chen*
>> >> >> > >                                 <notificati...@github.com
>> >> >> > >                                 <mailto:
>> notificati...@github.com>>
>> >> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36
>> AM
>> >> >> > >                                 Subject: Re: [apache/beam]
>> [BEAM-5442]
>> >> >> > >                                 Store duplicate unknown options
>> in a
>> >> >> > >                                 list argument (#6600)
>> >> >> > >                                 To: apache/beam <
>> b...@noreply.github.com
>> >> >> > >                                 <mailto:b...@noreply.github.com
>> >>
>> >> >> > >                                 Cc: Thomas Weise <
>> thomas.we...@gmail.com
>> >> >> > >                                 <mailto:thomas.we...@gmail.com
>> >>,
>> >> >> > >                                 Mention <
>> ment...@noreply.github.com
>> >> >> > >                                 <mailto:
>> ment...@noreply.github.com>>
>> >> >> > >
>> >> >> > >
>> >> >> > >                                 CC: @tweise <
>> https://github.com/tweise>
>> >> >> > >
>> >> >> > >                                 —
>> >> >> > >                                 You are receiving this because
>> you were
>> >> >> > >                                 mentioned.
>> >> >> > >                                 Reply to this email directly,
>> view it on
>> >> >> > >                                 GitHub
>> >> >> > >                                 <
>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >> >> > >                                 or mute the thread
>> >> >> > >                                 <
>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>> >.
>> >> >> > >
>> >> >> > >
>> >> >> > >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to