Yes, we don't know how to parse and/or validate it.
On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
I see, is the issue that we currently are using a JSON
representation for options when being serialized and when we get
some unknown option, we don't know how to convert it into its JSON form?
On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <rober...@google.com
<mailto:rober...@google.com>> wrote:
On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
>
> On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
<rober...@google.com <mailto:rober...@google.com>> wrote:
>>
>> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
<lc...@google.com <mailto:lc...@google.com>> wrote:
>> >
>> > I agree with the sentiment for better error checking.
>> >
>> > We can try to make it such that the SDK can "fetch" the
set of options that the runner supports by making a call to the
Job API. The API could return a list of option names
(descriptions for --help purposes and also potentially the
expected format) which would remove the worry around "unknown"
options. Yes I understand to be able to make the Job API call,
we may need to parse some options from the args parameters first
and then parse the unknown options after they are fetched.
>>
>> This is an interesting idea, but seems it could get quite
complicated.
>> E.g. for delegating runners, one would first read the options to
>> determine which runner to fetch the options from, which
would then
>> return a set of options that possibly depends on the values
of some of
>> its options...
>>
>> > Alternatively, we can choose an explicit format upfront.
>> > To expand on the exact format for --runner_option=...,
here are some different ideas:
>> > 1) Specified multiple times, each one is an explicit flag
>> > --runner_option=--blah=bar --runner_option=--foo=baz1
--runner_option=--foo=baz2
>>
>> I'm -1 on this format. We should move away from the idea
that options
>> == flags (as that doesn't compose well with other libraries
that do
>> their own flags parsing). The ability to parse a set of
flags into
>> options is just a convenience that an author may (or may
not) choose
>> to use (e.g. when running pipelines a long-lived process like a
>> service or a notebook, the command line flags are almost
certainly not
>> the right interface).
>>
>> > 2) specified multiple times, we drop the explicit flag
>> > --runner_option=blah=bar --runner_option=foo=baz1
--runner_option=foo=baz2
>>
>> This or (4) is my preference.
>>
>> > 3) we use a string which the runner can choose to
interpret however they want (JSON/XML shown below)
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >
--runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>
>> This would make validation hard. Also, I think it makes
sense for some
>> runner options to be "shared" (parallelism") by convention,
so letting
>> it be a free-form string wouldn't allow different runners to
inspect
>> different bits.
>>
>> We should consider if we should use urns for namespacing, and
>> assigning semantic meaning to strings, here.
>>
>> > 4) we use a string which must be a specific format such as
JSON (allows the SDK to do simple validation):
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> I like this in that at least some validation can be
performed, and
>> expectations of how to format richer types. On the other
hand it gets
>> a bit verbose, given that most (I'd imagine) options will be
simple.
>> As with normal options,
>>
>> --option1=value1 --option2=value2
>>
>> is shorthand for {"option1": value1, "option2": value2}.
>>
> I lean to 4 the most. With 2, you run into issues of what
does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
"d"] mean?
> Is it an error or list of lists or concatenated. Similar
issues for map types represented via JSON object {...}
We can err to be on the safe side unless/until an argument can
be made
that merging is more natural. I just think this will be excessively
verbose to use.
>> > I would strongly suggest that we go with the "fetch"
approach, since this makes the set of options discoverable and
helps users find errors much earlier in their pipeline.
>>
>> This seems like an advanced feature that SDKs may want to
support, but
>> I wouldn't want to require this complexity for bootstrapping
an SDK.
>>
> SDKs that are starting off wouldn't need to "fetch" options,
they could choose to not support runner options or they could
choose to pass all options through to the runner blindly.
Fetching the options only provides the SDK the ability to
provide error checking upfront and useful error/help messages.
But how to even pass all options through blindly is exactly the
difficulty we're running into here.
>> Regarding always keeping runner options separate, +1, though
I'm not
>> sure the line is always clear.
>>
>>
>> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
<rober...@google.com <mailto:rober...@google.com>> wrote:
>> >>
>> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
<m...@apache.org <mailto:m...@apache.org>> wrote:
>> >> >
>> >> > I agree that the current approach breaks the pipeline
options contract
>> >> > because "unknown" options get parsed in the same way as
options which
>> >> > have been defined by the user.
>> >>
>> >> FWIW, I think we're already breaking this "contract."
Unknown options
>> >> are silently ignored; with this change we just change how
we record
>> >> them. It still feels a bit hacky though.
>> >>
>> >> > I'm not sure the `experiments` flag works for us. AFAIK
it only allows
>> >> > true/false flags. We want to pass all types of pipeline
options to the
>> >> > Runner.
>> >>
>> >> Experiments is an arbitrary set of strings, which can be
of the form
>> >> "param=value" if that's useful. (Dataflow does this.)
There is, again,
>> >> no namespacing on the param names, but we could user urns
or impose
>> >> some other structure here.
>> >>
>> >> > How to solve this?
>> >> >
>> >> > 1) Add all options of all Runners to each SDK
>> >> > We added some of the FlinkRunner options to the Python
SDK but realized
>> >> > syncing is rather cumbersome in the long term. However,
we want the most
>> >> > important options to be validated on the client side.
>> >>
>> >> I don't think this is sustainable in the long run.
However, thinking
>> >> about this, in the worse case validation happens after
construction
>> >> but before execution (as with much of our other
validation) so it
>> >> isn't that bad.
>> >>
>> >> > 2) Pass "unknown" options via a separate list in the
Proto which can
>> >> > only be accessed internally by the Runners. This still
allows passing
>> >> > arbitrary options but we wouldn't leak unknown options
and display them
>> >> > as top-level options.
>> >>
>> >> I think there needs to be a way for the user to
communicate values
>> >> directly to the runner regardless of the SDK. My
preference would be
>> >> to make this explicit, e.g. (repeated)
--runner_option=..., rather
>> >> than scooping up all unknown flags at command line
parsing time.
>> >> Perhaps an SDK that is aware of some runners could choose
to lift
>> >> these as top-level options, but still pass them as runner
options.
>> >>
>> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >> > > The current release branch
>> >> > >
(https://github.com/apache/beam/commits/release-2.8.0) was cut
after the
>> >> > > revert went in. Sent out
https://github.com/apache/beam/pull/6683 as a
>> >> > > revert of the revert. Regarding your comment above,
I can help out with
>> >> > > the design / PR reviews for common Python code as you
suggest.
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
<t...@apache.org <mailto:t...@apache.org>
>> >> > > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
>> >> > >
>> >> > > Thanks, will tag you and looking forward to
feedback so we can
>> >> > > ensure that changes work for everyone.
>> >> > >
>> >> > > Looking at the PR, I see agreement from Max to
revert the change on
>> >> > > the release branch, but not in master. Would you
mind to restore it
>> >> > > in master?
>> >> > >
>> >> > > Thanks
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
<al...@google.com <mailto:al...@google.com>
>> >> > > <mailto:al...@google.com
<mailto:al...@google.com>>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles
Chen <c...@google.com <mailto:c...@google.com>
>> >> > > <mailto:c...@google.com
<mailto:c...@google.com>>> wrote:
>> >> > >
>> >> > > What I mean is that a user may find that
it works for them
>> >> > > to pass "--myarg blah" and access it as
"options.myarg"
>> >> > > without explicitly defining a "my_arg"
flag due to the added
>> >> > > logic. This is not the intended behavior
and we may want to
>> >> > > change this implementation detail in the
future. However,
>> >> > > having this logic in a released version
makes it hard to
>> >> > > change this behavior since users may
erroneously depend on
>> >> > > this undocumented behavior. Instead, we
should namespace /
>> >> > > scope this so that it is obvious that
this is meant for
>> >> > > runner (and not Beam user) consumption.
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> >> > > <t...@apache.org <mailto:t...@apache.org>
<mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
>> >> > >
>> >> > > Can you please elaborate more what
practical problems
>> >> > > this introduces for users?
>> >> > >
>> >> > > I can see that this change allows a
user to specify a
>> >> > > runner specific option, which in the
future may change
>> >> > > because we decide to scope
differently. If this only
>> >> > > affects users of the portable Flink
runner (like us),
>> >> > > then no need to revert, because at
this early stage we
>> >> > > prefer something that works over
being blocked.
>> >> > >
>> >> > > It would also be really great if some
of the core Python
>> >> > > SDK developers could help out with
the design aspects
>> >> > > and PR reviews of changes that affect
common Python
>> >> > > code. Anyone who specifically wants
to be tagged on
>> >> > > relevant JIRAs and PRs?
>> >> > >
>> >> > >
>> >> > > I would be happy to be tagged, and I can also
help with
>> >> > > including other relevant folks whenever
possible. In general I
>> >> > > think Robert, Charles, myself are good
candidates.
>> >> > >
>> >> > >
>> >> > > Thanks
>> >> > >
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 10:20 AM
Ahmet Altay
>> >> > > <al...@google.com
<mailto:al...@google.com> <mailto:al...@google.com
<mailto:al...@google.com>>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 10:11 AM,
Charles Chen
>> >> > > <c...@google.com
<mailto:c...@google.com> <mailto:c...@google.com
<mailto:c...@google.com>>> wrote:
>> >> > >
>> >> > > For context, I made comments on
>> >> > > https://github.com/apache/beam/pull/6600 noting
>> >> > > that the changes being made
were not good for
>> >> > > Beam
backwards-compatibility. The change as is
>> >> > > allows users to use pipeline
options without
>> >> > > explicitly defining them,
which is not the type
>> >> > > of usage we would like to
encourage since we
>> >> > > prefer to be explicit
whenever possible. If
>> >> > > users write pipelines with
this sort of pattern,
>> >> > > they will potentially
encounter pain when
>> >> > > upgrading to a later version
since this is an
>> >> > > implementation detail and not
an officially
>> >> > > supported pattern. I agree
with the comments
>> >> > > above that this is ultimately
a scoping issue.
>> >> > > I would not have a problem
with these changes if
>> >> > > they were explicitly scoped
under either a
>> >> > > runner or unparsed options
namespace.
>> >> > >
>> >> > > As a second note, since the
2.8.0 release is
>> >> > > being cut right now, because
of these
>> >> > > backwards-compatibility
concerns, I would
>> >> > > suggest reverting these
changes, at least until
>> >> > > 2.8.0 is cut, so we can have
a discussion here
>> >> > > before committing to and
releasing any API-level
>> >> > > changes.
>> >> > >
>> >> > >
>> >> > > +1 I would like to revert the
changes in order not
>> >> > > rush this into the release. Once
this discussion
>> >> > > results in an agreement changes
can be brought back.
>> >> > >
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 9:26
AM Henning Rohde
>> >> > > <hero...@google.com
<mailto:hero...@google.com> <mailto:hero...@google.com
<mailto:hero...@google.com>>>
>> >> > > wrote:
>> >> > >
>> >> > > Agree that pipeline
options lack some
>> >> > > mechanism for scoping. It
is also not always
>> >> > > possible distinguish
options meant to be
>> >> > > consumed at pipeline
construction time, by
>> >> > > the runner, by the SDK
harness, by the user
>> >> > > code or any combination
-- and this causes
>> >> > > confusion every now and then.
>> >> > >
>> >> > > For Dataflow, we have
been using
>> >> > > "experiments" for
arbitrary runner-specific
>> >> > > options. It's simply a
string list pipeline
>> >> > > option that all SDKs
support and, for Go at
>> >> > > least, is sent to
portable runners. Flink
>> >> > > can do the same in the
short term to move
>> >> > > forward.
>> >> > >
>> >> > > Henning
>> >> > >
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at
8:50 AM Thomas Weise
>> >> > > <t...@apache.org
<mailto:t...@apache.org> <mailto:t...@apache.org
<mailto:t...@apache.org>>> wrote:
>> >> > >
>> >> > > [moving to the list]
>> >> > >
>> >> > > The requirement
driving this part of the
>> >> > > change was to allow a
user to specify
>> >> > > pipeline options that
a runner supports
>> >> > > without having to
declare those in each
>> >> > > language SDK.
>> >> > >
>> >> > > In the specific
scenario, we have
>> >> > > options that the
Flink runner supports
>> >> > > (and can validate),
that are not
>> >> > > enumerated in the
Python SDK.
>> >> > >
>> >> > > I think we have a
bigger problem scoping
>> >> > > pipeline options. For
example, the
>> >> > > runner options are
dumped into the SDK
>> >> > > worker. There is also
a possibility of
>> >> > > name collisions. So I
think this would
>> >> > > benefit from broader
feedback.
>> >> > >
>> >> > > Thanks,
>> >> > > Thomas
>> >> > >
>> >> > >
>> >> > > ---------- Forwarded
message ---------
>> >> > > From: *Charles Chen*
>> >> > >
<notificati...@github.com <mailto:notificati...@github.com>
>> >> > >
<mailto:notificati...@github.com
<mailto:notificati...@github.com>>>
>> >> > > Date: Fri, Oct 12,
2018 at 8:36 AM
>> >> > > Subject: Re:
[apache/beam] [BEAM-5442]
>> >> > > Store duplicate
unknown options in a
>> >> > > list argument (#6600)
>> >> > > To: apache/beam
<b...@noreply.github.com <mailto:b...@noreply.github.com>
>> >> > >
<mailto:b...@noreply.github.com <mailto:b...@noreply.github.com>>>
>> >> > > Cc: Thomas Weise
<thomas.we...@gmail.com <mailto:thomas.we...@gmail.com>
>> >> > >
<mailto:thomas.we...@gmail.com <mailto:thomas.we...@gmail.com>>>,
>> >> > > Mention
<ment...@noreply.github.com <mailto:ment...@noreply.github.com>
>> >> > >
<mailto:ment...@noreply.github.com
<mailto:ment...@noreply.github.com>>>
>> >> > >
>> >> > >
>> >> > > CC: @tweise
<https://github.com/tweise>
>> >> > >
>> >> > > —
>> >> > > You are receiving
this because you were
>> >> > > mentioned.
>> >> > > Reply to this email
directly, view it on
>> >> > > GitHub
>> >> > >
<https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >> > > or mute the thread
>> >> > >
<https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>> >> > >
>> >> > >
>> >> > >