Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Lukasz Cwik Mon, 15 Oct 2018 16:14:51 -0700

I see, is the issue that we currently are using a JSON representation for
options when being serialized and when we get some unknown option, we don't
know how to convert it into its JSON form?


On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <[email protected]> wrote:

> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <[email protected]> wrote:
> >
> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <[email protected]> wrote:
> >> >
> >> > I agree with the sentiment for better error checking.
> >> >
> >> > We can try to make it such that the SDK can "fetch" the set of
> options that the runner supports by making a call to the Job API. The API
> could return a list of option names (descriptions for --help purposes and
> also potentially the expected format) which would remove the worry around
> "unknown" options. Yes I understand to be able to make the Job API call, we
> may need to parse some options from the args parameters first and then
> parse the unknown options after they are fetched.
> >>
> >> This is an interesting idea, but seems it could get quite complicated.
> >> E.g. for delegating runners, one would first read the options to
> >> determine which runner to fetch the options from, which would then
> >> return a set of options that possibly depends on the values of some of
> >> its options...
> >>
> >> > Alternatively, we can choose an explicit format upfront.
> >> > To expand on the exact format for --runner_option=..., here are some
> different ideas:
> >> > 1) Specified multiple times, each one is an explicit flag
> >> > --runner_option=--blah=bar --runner_option=--foo=baz1
> --runner_option=--foo=baz2
> >>
> >> I'm -1 on this format. We should move away from the idea that options
> >> == flags (as that doesn't compose well with other libraries that do
> >> their own flags parsing). The ability to parse a set of flags into
> >> options is just a convenience that an author may (or may not) choose
> >> to use (e.g. when running pipelines a long-lived process like a
> >> service or a notebook, the command line flags are almost certainly not
> >> the right interface).
> >>
> >> > 2) specified multiple times, we drop the explicit flag
> >> > --runner_option=blah=bar --runner_option=foo=baz1
> --runner_option=foo=baz2
> >>
> >> This or (4) is my preference.
> >>
> >> > 3) we use a string which the runner can choose to interpret however
> they want (JSON/XML shown below)
> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >> >
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >>
> >> This would make validation hard. Also, I think it makes sense for some
> >> runner options to be "shared" (parallelism") by convention, so letting
> >> it be a free-form string wouldn't allow different runners to inspect
> >> different bits.
> >>
> >> We should consider if we should use urns for namespacing, and
> >> assigning semantic meaning to strings, here.
> >>
> >> > 4) we use a string which must be a specific format such as JSON
> (allows the SDK to do simple validation):
> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >>
> >> I like this in that at least some validation can be performed, and
> >> expectations of how to format richer types. On the other hand it gets
> >> a bit verbose, given that most (I'd imagine) options will be simple.
> >> As with normal options,
> >>
> >>     --option1=value1 --option2=value2
> >>
> >> is shorthand for {"option1": value1, "option2": value2}.
> >>
> > I lean to 4 the most. With 2, you run into issues of what does
> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
> > Is it an error or list of lists or concatenated. Similar issues for map
> types represented via JSON object {...}
>
> We can err to be on the safe side unless/until an argument can be made
> that merging is more natural. I just think this will be excessively
> verbose to use.
>
> >> > I would strongly suggest that we go with the "fetch" approach, since
> this makes the set of options discoverable and helps users find errors much
> earlier in their pipeline.
> >>
> >> This seems like an advanced feature that SDKs may want to support, but
> >> I wouldn't want to require this complexity for bootstrapping an SDK.
> >>
> > SDKs that are starting off wouldn't need to "fetch" options, they could
> choose to not support runner options or they could choose to pass all
> options through to the runner blindly. Fetching the options only provides
> the SDK the ability to provide error checking upfront and useful error/help
> messages.
>
> But how to even pass all options through blindly is exactly the
> difficulty we're running into here.
>
> >> Regarding always keeping runner options separate, +1, though I'm not
> >> sure the line is always clear.
> >>
> >>
> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <[email protected]>
> wrote:
> >> >>
> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <[email protected]>
> wrote:
> >> >> >
> >> >> > I agree that the current approach breaks the pipeline options
> contract
> >> >> > because "unknown" options get parsed in the same way as options
> which
> >> >> > have been defined by the user.
> >> >>
> >> >> FWIW, I think we're already breaking this "contract." Unknown options
> >> >> are silently ignored; with this change we just change how we record
> >> >> them. It still feels a bit hacky though.
> >> >>
> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only
> allows
> >> >> > true/false flags. We want to pass all types of pipeline options to
> the
> >> >> > Runner.
> >> >>
> >> >> Experiments is an arbitrary set of strings, which can be of the form
> >> >> "param=value" if that's useful. (Dataflow does this.) There is,
> again,
> >> >> no namespacing on the param names, but we could user urns or impose
> >> >> some other structure here.
> >> >>
> >> >> > How to solve this?
> >> >> >
> >> >> > 1) Add all options of all Runners to each SDK
> >> >> > We added some of the FlinkRunner options to the Python SDK but
> realized
> >> >> > syncing is rather cumbersome in the long term. However, we want
> the most
> >> >> > important options to be validated on the client side.
> >> >>
> >> >> I don't think this is sustainable in the long run. However, thinking
> >> >> about this, in the worse case validation happens after construction
> >> >> but before execution (as with much of our other validation) so it
> >> >> isn't that bad.
> >> >>
> >> >> > 2) Pass "unknown" options via a separate list in the Proto which
> can
> >> >> > only be accessed internally by the Runners. This still allows
> passing
> >> >> > arbitrary options but we wouldn't leak unknown options and display
> them
> >> >> > as top-level options.
> >> >>
> >> >> I think there needs to be a way for the user to communicate values
> >> >> directly to the runner regardless of the SDK. My preference would be
> >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
> >> >> than scooping up all unknown flags at command line parsing time.
> >> >> Perhaps an SDK that is aware of some runners could choose to lift
> >> >> these as top-level options, but still pass them as runner options.
> >> >>
> >> >> > On 13.10.18 02:34, Charles Chen wrote:
> >> >> > > The current release branch
> >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
> after the
> >> >> > > revert went in.  Sent out
> https://github.com/apache/beam/pull/6683 as a
> >> >> > > revert of the revert.  Regarding your comment above, I can help
> out with
> >> >> > > the design / PR reviews for common Python code as you suggest.
> >> >> > >
> >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <[email protected]
> >> >> > > <mailto:[email protected]>> wrote:
> >> >> > >
> >> >> > >     Thanks, will tag you and looking forward to feedback so we
> can
> >> >> > >     ensure that changes work for everyone.
> >> >> > >
> >> >> > >     Looking at the PR, I see agreement from Max to revert the
> change on
> >> >> > >     the release branch, but not in master. Would you mind to
> restore it
> >> >> > >     in master?
> >> >> > >
> >> >> > >     Thanks
> >> >> > >
> >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <
> [email protected]
> >> >> > >     <mailto:[email protected]>> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
> [email protected]
> >> >> > >         <mailto:[email protected]>> wrote:
> >> >> > >
> >> >> > >             What I mean is that a user may find that it works
> for them
> >> >> > >             to pass "--myarg blah" and access it as
> "options.myarg"
> >> >> > >             without explicitly defining a "my_arg" flag due to
> the added
> >> >> > >             logic.  This is not the intended behavior and we may
> want to
> >> >> > >             change this implementation detail in the future.
> However,
> >> >> > >             having this logic in a released version makes it
> hard to
> >> >> > >             change this behavior since users may erroneously
> depend on
> >> >> > >             this undocumented behavior.  Instead, we should
> namespace /
> >> >> > >             scope this so that it is obvious that this is meant
> for
> >> >> > >             runner (and not Beam user) consumption.
> >> >> > >
> >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> >> >> > >             <[email protected] <mailto:[email protected]>> wrote:
> >> >> > >
> >> >> > >                 Can you please elaborate more what practical
> problems
> >> >> > >                 this introduces for users?
> >> >> > >
> >> >> > >                 I can see that this change allows a user to
> specify a
> >> >> > >                 runner specific option, which in the future may
> change
> >> >> > >                 because we decide to scope differently. If this
> only
> >> >> > >                 affects users of the portable Flink runner (like
> us),
> >> >> > >                 then no need to revert, because at this early
> stage we
> >> >> > >                 prefer something that works over being blocked.
> >> >> > >
> >> >> > >                 It would also be really great if some of the
> core Python
> >> >> > >                 SDK developers could help out with the design
> aspects
> >> >> > >                 and PR reviews of changes that affect common
> Python
> >> >> > >                 code. Anyone who specifically wants to be tagged
> on
> >> >> > >                 relevant JIRAs and PRs?
> >> >> > >
> >> >> > >
> >> >> > >         I would be happy to be tagged, and I can also help with
> >> >> > >         including other relevant folks whenever possible. In
> general I
> >> >> > >         think Robert, Charles, myself are good candidates.
> >> >> > >
> >> >> > >
> >> >> > >                 Thanks
> >> >> > >
> >> >> > >
> >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> >> >> > >                 <[email protected] <mailto:[email protected]>>
> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles
> Chen
> >> >> > >                     <[email protected] <mailto:[email protected]>>
> wrote:
> >> >> > >
> >> >> > >                         For context, I made comments on
> >> >> > >                         https://github.com/apache/beam/pull/6600
> noting
> >> >> > >                         that the changes being made were not
> good for
> >> >> > >                         Beam backwards-compatibility.  The
> change as is
> >> >> > >                         allows users to use pipeline options
> without
> >> >> > >                         explicitly defining them, which is not
> the type
> >> >> > >                         of usage we would like to encourage
> since we
> >> >> > >                         prefer to be explicit whenever
> possible.  If
> >> >> > >                         users write pipelines with this sort of
> pattern,
> >> >> > >                         they will potentially encounter pain when
> >> >> > >                         upgrading to a later version since this
> is an
> >> >> > >                         implementation detail and not an
> officially
> >> >> > >                         supported pattern.  I agree with the
> comments
> >> >> > >                         above that this is ultimately a scoping
> issue.
> >> >> > >                         I would not have a problem with these
> changes if
> >> >> > >                         they were explicitly scoped under either
> a
> >> >> > >                         runner or unparsed options namespace.
> >> >> > >
> >> >> > >                         As a second note, since the 2.8.0
> release is
> >> >> > >                         being cut right now, because of these
> >> >> > >                         backwards-compatibility concerns, I would
> >> >> > >                         suggest reverting these changes, at
> least until
> >> >> > >                         2.8.0 is cut, so we can have a
> discussion here
> >> >> > >                         before committing to and releasing any
> API-level
> >> >> > >                         changes.
> >> >> > >
> >> >> > >
> >> >> > >                     +1 I would like to revert the changes in
> order not
> >> >> > >                     rush this into the release. Once this
> discussion
> >> >> > >                     results in an agreement changes can be
> brought back.
> >> >> > >
> >> >> > >
> >> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning
> Rohde
> >> >> > >                         <[email protected] <mailto:
> [email protected]>>
> >> >> > >                         wrote:
> >> >> > >
> >> >> > >                             Agree that pipeline options lack some
> >> >> > >                             mechanism for scoping. It is also
> not always
> >> >> > >                             possible distinguish options meant
> to be
> >> >> > >                             consumed at pipeline construction
> time, by
> >> >> > >                             the runner, by the SDK harness, by
> the user
> >> >> > >                             code or any combination -- and this
> causes
> >> >> > >                             confusion every now and then.
> >> >> > >
> >> >> > >                             For Dataflow, we have been using
> >> >> > >                             "experiments" for arbitrary
> runner-specific
> >> >> > >                             options. It's simply a string list
> pipeline
> >> >> > >                             option that all SDKs support and,
> for Go at
> >> >> > >                             least, is sent to portable runners.
> Flink
> >> >> > >                             can do the same in the short term to
> move
> >> >> > >                             forward.
> >> >> > >
> >> >> > >                             Henning
> >> >> > >
> >> >> > >
> >> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM
> Thomas Weise
> >> >> > >                             <[email protected] <mailto:
> [email protected]>> wrote:
> >> >> > >
> >> >> > >                                 [moving to the list]
> >> >> > >
> >> >> > >                                 The requirement driving this
> part of the
> >> >> > >                                 change was to allow a user to
> specify
> >> >> > >                                 pipeline options that a runner
> supports
> >> >> > >                                 without having to declare those
> in each
> >> >> > >                                 language SDK.
> >> >> > >
> >> >> > >                                 In the specific scenario, we have
> >> >> > >                                 options that the Flink runner
> supports
> >> >> > >                                 (and can validate), that are not
> >> >> > >                                 enumerated in the Python SDK.
> >> >> > >
> >> >> > >                                 I think we have a bigger problem
> scoping
> >> >> > >                                 pipeline options. For example,
> the
> >> >> > >                                 runner options are dumped into
> the SDK
> >> >> > >                                 worker. There is also a
> possibility of
> >> >> > >                                 name collisions. So I think this
> would
> >> >> > >                                 benefit from broader feedback.
> >> >> > >
> >> >> > >                                 Thanks,
> >> >> > >                                 Thomas
> >> >> > >
> >> >> > >
> >> >> > >                                 ---------- Forwarded message
> ---------
> >> >> > >                                 From: *Charles Chen*
> >> >> > >                                 <[email protected]
> >> >> > >                                 <mailto:[email protected]
> >>
> >> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36
> AM
> >> >> > >                                 Subject: Re: [apache/beam]
> [BEAM-5442]
> >> >> > >                                 Store duplicate unknown options
> in a
> >> >> > >                                 list argument (#6600)
> >> >> > >                                 To: apache/beam <
> [email protected]
> >> >> > >                                 <mailto:[email protected]
> >>
> >> >> > >                                 Cc: Thomas Weise <
> [email protected]
> >> >> > >                                 <mailto:[email protected]
> >>,
> >> >> > >                                 Mention <
> [email protected]
> >> >> > >                                 <mailto:
> [email protected]>>
> >> >> > >
> >> >> > >
> >> >> > >                                 CC: @tweise <
> https://github.com/tweise>
> >> >> > >
> >> >> > >                                 —
> >> >> > >                                 You are receiving this because
> you were
> >> >> > >                                 mentioned.
> >> >> > >                                 Reply to this email directly,
> view it on
> >> >> > >                                 GitHub
> >> >> > >                                 <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >> >> > >                                 or mute the thread
> >> >> > >                                 <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >> >> > >
> >> >> > >
> >> >> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to