Yes, we don't know how to parse and/or validate it. On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lc...@google.com> wrote:
> I see, is the issue that we currently are using a JSON representation for > options when being serialized and when we get some unknown option, we don't > know how to convert it into its JSON form? > > On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <rober...@google.com> > wrote: > >> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote: >> > >> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >> >> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote: >> >> > >> >> > I agree with the sentiment for better error checking. >> >> > >> >> > We can try to make it such that the SDK can "fetch" the set of >> options that the runner supports by making a call to the Job API. The API >> could return a list of option names (descriptions for --help purposes and >> also potentially the expected format) which would remove the worry around >> "unknown" options. Yes I understand to be able to make the Job API call, we >> may need to parse some options from the args parameters first and then >> parse the unknown options after they are fetched. >> >> >> >> This is an interesting idea, but seems it could get quite complicated. >> >> E.g. for delegating runners, one would first read the options to >> >> determine which runner to fetch the options from, which would then >> >> return a set of options that possibly depends on the values of some of >> >> its options... >> >> >> >> > Alternatively, we can choose an explicit format upfront. >> >> > To expand on the exact format for --runner_option=..., here are some >> different ideas: >> >> > 1) Specified multiple times, each one is an explicit flag >> >> > --runner_option=--blah=bar --runner_option=--foo=baz1 >> --runner_option=--foo=baz2 >> >> >> >> I'm -1 on this format. We should move away from the idea that options >> >> == flags (as that doesn't compose well with other libraries that do >> >> their own flags parsing). The ability to parse a set of flags into >> >> options is just a convenience that an author may (or may not) choose >> >> to use (e.g. when running pipelines a long-lived process like a >> >> service or a notebook, the command line flags are almost certainly not >> >> the right interface). >> >> >> >> > 2) specified multiple times, we drop the explicit flag >> >> > --runner_option=blah=bar --runner_option=foo=baz1 >> --runner_option=foo=baz2 >> >> >> >> This or (4) is my preference. >> >> >> >> > 3) we use a string which the runner can choose to interpret however >> they want (JSON/XML shown below) >> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> >> > >> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>' >> >> >> >> This would make validation hard. Also, I think it makes sense for some >> >> runner options to be "shared" (parallelism") by convention, so letting >> >> it be a free-form string wouldn't allow different runners to inspect >> >> different bits. >> >> >> >> We should consider if we should use urns for namespacing, and >> >> assigning semantic meaning to strings, here. >> >> >> >> > 4) we use a string which must be a specific format such as JSON >> (allows the SDK to do simple validation): >> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> >> >> >> I like this in that at least some validation can be performed, and >> >> expectations of how to format richer types. On the other hand it gets >> >> a bit verbose, given that most (I'd imagine) options will be simple. >> >> As with normal options, >> >> >> >> --option1=value1 --option2=value2 >> >> >> >> is shorthand for {"option1": value1, "option2": value2}. >> >> >> > I lean to 4 the most. With 2, you run into issues of what does >> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean? >> > Is it an error or list of lists or concatenated. Similar issues for map >> types represented via JSON object {...} >> >> We can err to be on the safe side unless/until an argument can be made >> that merging is more natural. I just think this will be excessively >> verbose to use. >> >> >> > I would strongly suggest that we go with the "fetch" approach, since >> this makes the set of options discoverable and helps users find errors much >> earlier in their pipeline. >> >> >> >> This seems like an advanced feature that SDKs may want to support, but >> >> I wouldn't want to require this complexity for bootstrapping an SDK. >> >> >> > SDKs that are starting off wouldn't need to "fetch" options, they could >> choose to not support runner options or they could choose to pass all >> options through to the runner blindly. Fetching the options only provides >> the SDK the ability to provide error checking upfront and useful error/help >> messages. >> >> But how to even pass all options through blindly is exactly the >> difficulty we're running into here. >> >> >> Regarding always keeping runner options separate, +1, though I'm not >> >> sure the line is always clear. >> >> >> >> >> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >> >> >> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <m...@apache.org> >> wrote: >> >> >> > >> >> >> > I agree that the current approach breaks the pipeline options >> contract >> >> >> > because "unknown" options get parsed in the same way as options >> which >> >> >> > have been defined by the user. >> >> >> >> >> >> FWIW, I think we're already breaking this "contract." Unknown >> options >> >> >> are silently ignored; with this change we just change how we record >> >> >> them. It still feels a bit hacky though. >> >> >> >> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only >> allows >> >> >> > true/false flags. We want to pass all types of pipeline options >> to the >> >> >> > Runner. >> >> >> >> >> >> Experiments is an arbitrary set of strings, which can be of the form >> >> >> "param=value" if that's useful. (Dataflow does this.) There is, >> again, >> >> >> no namespacing on the param names, but we could user urns or impose >> >> >> some other structure here. >> >> >> >> >> >> > How to solve this? >> >> >> > >> >> >> > 1) Add all options of all Runners to each SDK >> >> >> > We added some of the FlinkRunner options to the Python SDK but >> realized >> >> >> > syncing is rather cumbersome in the long term. However, we want >> the most >> >> >> > important options to be validated on the client side. >> >> >> >> >> >> I don't think this is sustainable in the long run. However, thinking >> >> >> about this, in the worse case validation happens after construction >> >> >> but before execution (as with much of our other validation) so it >> >> >> isn't that bad. >> >> >> >> >> >> > 2) Pass "unknown" options via a separate list in the Proto which >> can >> >> >> > only be accessed internally by the Runners. This still allows >> passing >> >> >> > arbitrary options but we wouldn't leak unknown options and >> display them >> >> >> > as top-level options. >> >> >> >> >> >> I think there needs to be a way for the user to communicate values >> >> >> directly to the runner regardless of the SDK. My preference would be >> >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather >> >> >> than scooping up all unknown flags at command line parsing time. >> >> >> Perhaps an SDK that is aware of some runners could choose to lift >> >> >> these as top-level options, but still pass them as runner options. >> >> >> >> >> >> > On 13.10.18 02:34, Charles Chen wrote: >> >> >> > > The current release branch >> >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut >> after the >> >> >> > > revert went in. Sent out >> https://github.com/apache/beam/pull/6683 as a >> >> >> > > revert of the revert. Regarding your comment above, I can help >> out with >> >> >> > > the design / PR reviews for common Python code as you suggest. >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <t...@apache.org >> >> >> > > <mailto:t...@apache.org>> wrote: >> >> >> > > >> >> >> > > Thanks, will tag you and looking forward to feedback so we >> can >> >> >> > > ensure that changes work for everyone. >> >> >> > > >> >> >> > > Looking at the PR, I see agreement from Max to revert the >> change on >> >> >> > > the release branch, but not in master. Would you mind to >> restore it >> >> >> > > in master? >> >> >> > > >> >> >> > > Thanks >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay < >> al...@google.com >> >> >> > > <mailto:al...@google.com>> wrote: >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen < >> c...@google.com >> >> >> > > <mailto:c...@google.com>> wrote: >> >> >> > > >> >> >> > > What I mean is that a user may find that it works >> for them >> >> >> > > to pass "--myarg blah" and access it as >> "options.myarg" >> >> >> > > without explicitly defining a "my_arg" flag due to >> the added >> >> >> > > logic. This is not the intended behavior and we >> may want to >> >> >> > > change this implementation detail in the future. >> However, >> >> >> > > having this logic in a released version makes it >> hard to >> >> >> > > change this behavior since users may erroneously >> depend on >> >> >> > > this undocumented behavior. Instead, we should >> namespace / >> >> >> > > scope this so that it is obvious that this is meant >> for >> >> >> > > runner (and not Beam user) consumption. >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise >> >> >> > > <t...@apache.org <mailto:t...@apache.org>> wrote: >> >> >> > > >> >> >> > > Can you please elaborate more what practical >> problems >> >> >> > > this introduces for users? >> >> >> > > >> >> >> > > I can see that this change allows a user to >> specify a >> >> >> > > runner specific option, which in the future may >> change >> >> >> > > because we decide to scope differently. If this >> only >> >> >> > > affects users of the portable Flink runner >> (like us), >> >> >> > > then no need to revert, because at this early >> stage we >> >> >> > > prefer something that works over being blocked. >> >> >> > > >> >> >> > > It would also be really great if some of the >> core Python >> >> >> > > SDK developers could help out with the design >> aspects >> >> >> > > and PR reviews of changes that affect common >> Python >> >> >> > > code. Anyone who specifically wants to be >> tagged on >> >> >> > > relevant JIRAs and PRs? >> >> >> > > >> >> >> > > >> >> >> > > I would be happy to be tagged, and I can also help with >> >> >> > > including other relevant folks whenever possible. In >> general I >> >> >> > > think Robert, Charles, myself are good candidates. >> >> >> > > >> >> >> > > >> >> >> > > Thanks >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay >> >> >> > > <al...@google.com <mailto:al...@google.com>> >> wrote: >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 10:11 AM, Charles >> Chen >> >> >> > > <c...@google.com <mailto:c...@google.com>> >> wrote: >> >> >> > > >> >> >> > > For context, I made comments on >> >> >> > > >> https://github.com/apache/beam/pull/6600 noting >> >> >> > > that the changes being made were not >> good for >> >> >> > > Beam backwards-compatibility. The >> change as is >> >> >> > > allows users to use pipeline options >> without >> >> >> > > explicitly defining them, which is not >> the type >> >> >> > > of usage we would like to encourage >> since we >> >> >> > > prefer to be explicit whenever >> possible. If >> >> >> > > users write pipelines with this sort of >> pattern, >> >> >> > > they will potentially encounter pain >> when >> >> >> > > upgrading to a later version since this >> is an >> >> >> > > implementation detail and not an >> officially >> >> >> > > supported pattern. I agree with the >> comments >> >> >> > > above that this is ultimately a scoping >> issue. >> >> >> > > I would not have a problem with these >> changes if >> >> >> > > they were explicitly scoped under >> either a >> >> >> > > runner or unparsed options namespace. >> >> >> > > >> >> >> > > As a second note, since the 2.8.0 >> release is >> >> >> > > being cut right now, because of these >> >> >> > > backwards-compatibility concerns, I >> would >> >> >> > > suggest reverting these changes, at >> least until >> >> >> > > 2.8.0 is cut, so we can have a >> discussion here >> >> >> > > before committing to and releasing any >> API-level >> >> >> > > changes. >> >> >> > > >> >> >> > > >> >> >> > > +1 I would like to revert the changes in >> order not >> >> >> > > rush this into the release. Once this >> discussion >> >> >> > > results in an agreement changes can be >> brought back. >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 9:26 AM Henning >> Rohde >> >> >> > > <hero...@google.com <mailto: >> hero...@google.com>> >> >> >> > > wrote: >> >> >> > > >> >> >> > > Agree that pipeline options lack >> some >> >> >> > > mechanism for scoping. It is also >> not always >> >> >> > > possible distinguish options meant >> to be >> >> >> > > consumed at pipeline construction >> time, by >> >> >> > > the runner, by the SDK harness, by >> the user >> >> >> > > code or any combination -- and this >> causes >> >> >> > > confusion every now and then. >> >> >> > > >> >> >> > > For Dataflow, we have been using >> >> >> > > "experiments" for arbitrary >> runner-specific >> >> >> > > options. It's simply a string list >> pipeline >> >> >> > > option that all SDKs support and, >> for Go at >> >> >> > > least, is sent to portable runners. >> Flink >> >> >> > > can do the same in the short term >> to move >> >> >> > > forward. >> >> >> > > >> >> >> > > Henning >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 12, 2018 at 8:50 AM >> Thomas Weise >> >> >> > > <t...@apache.org <mailto: >> t...@apache.org>> wrote: >> >> >> > > >> >> >> > > [moving to the list] >> >> >> > > >> >> >> > > The requirement driving this >> part of the >> >> >> > > change was to allow a user to >> specify >> >> >> > > pipeline options that a runner >> supports >> >> >> > > without having to declare those >> in each >> >> >> > > language SDK. >> >> >> > > >> >> >> > > In the specific scenario, we >> have >> >> >> > > options that the Flink runner >> supports >> >> >> > > (and can validate), that are not >> >> >> > > enumerated in the Python SDK. >> >> >> > > >> >> >> > > I think we have a bigger >> problem scoping >> >> >> > > pipeline options. For example, >> the >> >> >> > > runner options are dumped into >> the SDK >> >> >> > > worker. There is also a >> possibility of >> >> >> > > name collisions. So I think >> this would >> >> >> > > benefit from broader feedback. >> >> >> > > >> >> >> > > Thanks, >> >> >> > > Thomas >> >> >> > > >> >> >> > > >> >> >> > > ---------- Forwarded message >> --------- >> >> >> > > From: *Charles Chen* >> >> >> > > <notificati...@github.com >> >> >> > > <mailto: >> notificati...@github.com>> >> >> >> > > Date: Fri, Oct 12, 2018 at 8:36 >> AM >> >> >> > > Subject: Re: [apache/beam] >> [BEAM-5442] >> >> >> > > Store duplicate unknown options >> in a >> >> >> > > list argument (#6600) >> >> >> > > To: apache/beam < >> b...@noreply.github.com >> >> >> > > <mailto:b...@noreply.github.com >> >> >> >> >> > > Cc: Thomas Weise < >> thomas.we...@gmail.com >> >> >> > > <mailto:thomas.we...@gmail.com >> >>, >> >> >> > > Mention < >> ment...@noreply.github.com >> >> >> > > <mailto: >> ment...@noreply.github.com>> >> >> >> > > >> >> >> > > >> >> >> > > CC: @tweise < >> https://github.com/tweise> >> >> >> > > >> >> >> > > — >> >> >> > > You are receiving this because >> you were >> >> >> > > mentioned. >> >> >> > > Reply to this email directly, >> view it on >> >> >> > > GitHub >> >> >> > > < >> https://github.com/apache/beam/pull/6600#issuecomment-429367754>, >> >> >> > > or mute the thread >> >> >> > > < >> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T >> >. >> >> >> > > >> >> >> > > >> >> >> > > >> >