Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Maximilian Michels Fri, 26 Oct 2018 09:20:02 -0700

I would prefer we don't introduce a (quirky) way of passing unknown options 
that forces users to type JSON into the command line (or similar acrobatics)

Same here, the JSON approach seems technically nice but too bulky for users.

To someone wanting to run a pipeline, all options are equally important, 
whether they are application specific, SDK specific or runner specific.

I'm also reluctant to force users to use `--runner_option=` because thedivision into "Runner" options and other options seems rather arbitraryto users. Most built-in options are also Runner-related.

It should be possible to *optionally* qualify/scope (to cover cases where there is ambiguity), but otherwise I prefer the format we currently have.

Yes, namespacing is a problem. What happens if the user defines a customPipelineOption which clashes with one of the builtin ones? If both areset, which one is actually applied?



Here is a summary of the possible paths going forward:


1) Validate PipelineOptions at Runner side
==========================================

The main issue raised here was that we want to move away from parsingarguments which look like options without validating them. An easy fixwould be to actually validate them on the Runner side. This could bedone by changing the deserialization code of PipelineOptions which sofar ignores unknown JSON options.


See: PipelineOptionsTranslation.fromProto(Struct protoOptions)

Actually, this wouldn't work for user-defined PipelineOptions becausethey might not be known to the Runner (if they are defined in Python).



2) Introduce a Runner-Option Flag
=================================

In this approach we would try to add as many pipeline options for aRunner to the SDK, but allow additional Runner options to be passedusing the `--runner-option=key=val` flag. The Runner, like in 1), wouldhave to ensure validation. I think this has been the most favored way sofar. Going forward, that means that `--parallelism=4` and`--runner-option=parallelism=4` will have the same effect for the FlinkRunner.



3) Implement Fetching of Options from JobServer
===============================================

The options are retrieved from the JobServer before submitting thepipeline. I think this would be ideal but, as mentioned before, itincreases the complexity for implementing new SDKs and might overalljust not be worth the effort.

What do you think? I'd implement 2) for the next release, unless thereare advocates for a different approach.


Cheers,
Max

On 25.10.18 21:19, Thomas Weise wrote:

Reminder that this is something we ideally address before the nextrelease...

Considering the discussion so far, my preference is that we get awayfrom unknown options and discover valid options from the runner (byexpanding the job service).

Once the SDK is aware of all valid options, it is possible to providemeaningful feedback to the user (validate or help), and correctly handlescopes and types.

I would prefer we don't introduce a (quirky) way of passing unknownoptions that forces users to type JSON into the command line (or similaracrobatics). To someone wanting to run a pipeline, all options areequally important, whether they are application specific, SDK specificor runner specific. It should be possible to *optionally* qualify/scope(to cover cases where there is ambiguity), but otherwise I prefer theformat we currently have.

Regarding type inference: Correct handling of numeric types matters, seefollowing issue with protobuf (not JSON):https://issues.apache.org/jira/browse/BEAM-5509


Thomas

On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw <[email protected]<mailto:[email protected]>> wrote:


    On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <[email protected]
    <mailto:[email protected]>> wrote:


        On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw
        <[email protected] <mailto:[email protected]>> wrote:

            On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik
            <[email protected] <mailto:[email protected]>> wrote:
             >
             > For all unknown options, the SDK can require that all
            flag values be specified explicitly as a valid JSON type.
             > starts with { -> object
             > starts with [ -> list
             > starts with " -> string
             > is null / true / false -> null / true / false
             > otherwise is number.
             >
             > This isn't great for strings but works well for all the
            other types.
             >
             > Thus for known options, the additional typing information
            would disambiguate whether something should be a
            string/boolean/number/object/list but for unknown options we
            would expect the user to use valid JSON explicitly and write:
             > --foo={"object": "value"}
             > --foo=["value", "value2"]
             > --foo="string value"

            Due to shell escaping, one would have to write

            --foo=\"string value\"

            or actually, due to the space

            --foo='"string value"'

            or some other variation on that, which is really
            unfortunate. (The JSON list/objects would need similar
            quoting, but that's less surprising.) Also, does this mean
            we'd only have one kind of number (not integer vs. float,

i.e. --parallelism=5.0 works)? I suppose that is JSON.


        Yes, I was suspecting that users would need to type the second
        variant as \"...\" I found more burdensome then '"..."'


             > --foo=3.5 --foo=-4
             > --foo=true --foo=false
             > --foo=null
             > This also works if the flag is repeated, so --foo=3.5
            --foo=-4 is [3.5, -4]

            The thing that sparked this discussion was what to do when
            unknown foo is repeated, but only one value given.


        If the person only specifies one value, then they have to
        disambiguate and put it in a list, only if they specify more
        then one value will they have to turn it into a list.

        I believe we could come up with other schemes on how to convert
        unknown options to JSON where we prefer strings over non-string
        types like null/true/false/numbers/list/object and require the
        user to escape out of the string default but anything that is
        too different from strict JSON would cause headaches when
        attempting to explain the format to users. I think a happy
        middle ground would be that we will only require escaping for
        strings which are ambiguous, so things like true, null, false,
        ... to be treated as strings would require the user to escape them.


    I'd prefer to avoid inferring the type of an unknown argument based
    on its contents, which can lead to surprises. We could declare every
    unknown type to be repeated string, and let any parsing/validation
    occur on the runner. If desired, we could pass these around as a
    single "runner options" dict that runners could inspect and use to
    populate the actual dict rather than mixing parsed and unparsed
    options.



             > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
            <[email protected] <mailto:[email protected]>> wrote:
             >>
             >> Discovering options from the job server seems preferable
            over replicating runner options in SDKs.
             >>
             >> Runners evolve on their own, and with portability the
            SDK does not need to know anything about the runner.
             >>
             >> Regarding --runner-option. It is true that this looks
            less user friendly. On the other hand it eliminates the
            possibility of name collisions.
             >>
             >> But if options are discovered, the SDK can perform full
            validation. It would only be necessary to use explicit
            scoping when there is ambiguity.
             >>
             >> Thomas
             >>
             >>
             >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
            <[email protected] <mailto:[email protected]>> wrote:
             >>>
             >>> Fetching options directly from the Runner's JobServer
            seems like the
             >>> ideal solution. I agree with Robert that it creates
            additional
             >>> complexity for SDK authors, so the `--runner-option`
            flag would be an
             >>> easy and explicit way to specify additional Runner options.
             >>>
             >>> The format I prefer would be: --runner_option=key1=val1
             >>> --runner_option=key2=val2
             >>>
             >>> Now, from the perspective of end users, I think it is
            neither convenient
             >>> nor reasonable to require the use of the
            `--runner-option` flag. To the
             >>> user it seems nebulous why some pipeline options live
            in the top-level
             >>> option namespace while others need to be nested within
            an option. This
             >>> is amplified by there being two Runners the user needs
            to be aware of,
             >>> i.e. PortableRunner and the actual Runner
            (Dataflow/Flink/Spark..).
             >>>
             >>> I feel like we would eventually replicate all options
            in the SDK because
             >>> otherwise users have to use the `--runner-option`, but
            at least we can
             >>> specify options which have not been replicated yet.
             >>>
             >>> -Max
             >>>
             >>> On 16.10.18 10:27, Robert Bradshaw wrote:
             >>> > Yes, we don't know how to parse and/or validate it.
             >>> >
             >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik
            <[email protected] <mailto:[email protected]>
             >>> > <mailto:[email protected] <mailto:[email protected]>>>
            wrote:
             >>> >
             >>> >     I see, is the issue that we currently are using a
            JSON
             >>> >     representation for options when being serialized
            and when we get
             >>> >     some unknown option, we don't know how to convert
            it into its JSON form?
             >>> >
             >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw
            <[email protected] <mailto:[email protected]>
             >>> >     <mailto:[email protected]
            <mailto:[email protected]>>> wrote:
             >>> >
             >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik
            <[email protected] <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>> wrote:
             >>> >          >
             >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert
            Bradshaw
             >>> >         <[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>> wrote:
             >>> >          >>
             >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
             >>> >         <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>> wrote:
             >>> >          >> >
             >>> >          >> > I agree with the sentiment for better
            error checking.
             >>> >          >> >
             >>> >          >> > We can try to make it such that the SDK
            can "fetch" the
             >>> >         set of options that the runner supports by
            making a call to the
             >>> >         Job API. The API could return a list of
            option names
             >>> >         (descriptions for --help purposes and also
            potentially the
             >>> >         expected format) which would remove the worry
            around "unknown"
             >>> >         options. Yes I understand to be able to make
            the Job API call,
             >>> >         we may need to parse some options from the
            args parameters first
             >>> >         and then parse the unknown options after they
            are fetched.
             >>> >          >>
             >>> >          >> This is an interesting idea, but seems it
            could get quite
             >>> >         complicated.
             >>> >          >> E.g. for delegating runners, one would
            first read the options to
             >>> >          >> determine which runner to fetch the
            options from, which
             >>> >         would then
             >>> >          >> return a set of options that possibly
            depends on the values
             >>> >         of some of
             >>> >          >> its options...
             >>> >          >>
             >>> >          >> > Alternatively, we can choose an
            explicit format upfront.
             >>> >          >> > To expand on the exact format for
            --runner_option=...,
             >>> >         here are some different ideas:
             >>> >          >> > 1) Specified multiple times, each one
            is an explicit flag
             >>> >          >> > --runner_option=--blah=bar
            --runner_option=--foo=baz1
             >>> >         --runner_option=--foo=baz2
             >>> >          >>
             >>> >          >> I'm -1 on this format. We should move
            away from the idea
             >>> >         that options
             >>> >          >> == flags (as that doesn't compose well
            with other libraries
             >>> >         that do
             >>> >          >> their own flags parsing). The ability to
            parse a set of
             >>> >         flags into
             >>> >          >> options is just a convenience that an
            author may (or may
             >>> >         not) choose
             >>> >          >> to use (e.g. when running pipelines a
            long-lived process like a
             >>> >          >> service or a notebook, the command line
            flags are almost
             >>> >         certainly not
             >>> >          >> the right interface).
             >>> >          >>
             >>> >          >> > 2) specified multiple times, we drop
            the explicit flag
             >>> >          >> > --runner_option=blah=bar
            --runner_option=foo=baz1
             >>> >         --runner_option=foo=baz2
             >>> >          >>
             >>> >          >> This or (4) is my preference.
             >>> >          >>
             >>> >          >> > 3) we use a string which the runner can
            choose to
             >>> >         interpret however they want (JSON/XML shown
            below)
             >>> >          >> > --runner_option='{"blah": "bar", "foo":
            ["baz1", "baz2"]}'
             >>> >          >> >

>>> >--runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'

             >>> >          >>
             >>> >          >> This would make validation hard. Also, I
            think it makes
             >>> >         sense for some
             >>> >          >> runner options to be "shared"
            (parallelism") by convention,
             >>> >         so letting
             >>> >          >> it be a free-form string wouldn't allow
            different runners to
             >>> >         inspect
             >>> >          >> different bits.
             >>> >          >>
             >>> >          >> We should consider if we should use urns
            for namespacing, and
             >>> >          >> assigning semantic meaning to strings, here.
             >>> >          >>
             >>> >          >> > 4) we use a string which must be a
            specific format such as
             >>> >         JSON (allows the SDK to do simple validation):
             >>> >          >> > --runner_option='{"blah": "bar", "foo":
            ["baz1", "baz2"]}'
             >>> >          >>
             >>> >          >> I like this in that at least some
            validation can be
             >>> >         performed, and
             >>> >          >> expectations of how to format richer
            types. On the other
             >>> >         hand it gets
             >>> >          >> a bit verbose, given that most (I'd
            imagine) options will be
             >>> >         simple.
             >>> >          >> As with normal options,
             >>> >          >>
             >>> >          >>     --option1=value1 --option2=value2
             >>> >          >>
             >>> >          >> is shorthand for {"option1": value1,
            "option2": value2}.
             >>> >          >>
             >>> >          > I lean to 4 the most. With 2, you run into
            issues of what
             >>> >         does --runner_option=foo=["a", "b"]
            --runner_option=foo=["c",
             >>> >         "d"] mean?
             >>> >          > Is it an error or list of lists or
            concatenated. Similar
             >>> >         issues for map types represented via JSON
            object {...}
             >>> >
             >>> >         We can err to be on the safe side
            unless/until an argument can
             >>> >         be made
             >>> >         that merging is more natural. I just think
            this will be excessively
             >>> >         verbose to use.
             >>> >
             >>> >          >> > I would strongly suggest that we go
            with the "fetch"
             >>> >         approach, since this makes the set of options
            discoverable and
             >>> >         helps users find errors much earlier in their
            pipeline.
             >>> >          >>
             >>> >          >> This seems like an advanced feature that
            SDKs may want to
             >>> >         support, but
             >>> >          >> I wouldn't want to require this
            complexity for bootstrapping
             >>> >         an SDK.
             >>> >          >>
             >>> >          > SDKs that are starting off wouldn't need
            to "fetch" options,
             >>> >         they could choose to not support runner
            options or they could
             >>> >         choose to pass all options through to the
            runner blindly.
             >>> >         Fetching the options only provides the SDK
            the ability to
             >>> >         provide error checking upfront and useful
            error/help messages.
             >>> >
             >>> >         But how to even pass all options through
            blindly is exactly the
             >>> >         difficulty we're running into here.
             >>> >
             >>> >          >> Regarding always keeping runner options
            separate, +1, though
             >>> >         I'm not
             >>> >          >> sure the line is always clear.
             >>> >          >>
             >>> >          >>
             >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert
            Bradshaw
             >>> >         <[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>> wrote:
             >>> >          >> >>
             >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM
            Maximilian Michels
             >>> >         <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>> wrote:
             >>> >          >> >> >
             >>> >          >> >> > I agree that the current approach
            breaks the pipeline
             >>> >         options contract
             >>> >          >> >> > because "unknown" options get parsed
            in the same way as
             >>> >         options which
             >>> >          >> >> > have been defined by the user.
             >>> >          >> >>
             >>> >          >> >> FWIW, I think we're already breaking
            this "contract."
             >>> >         Unknown options
             >>> >          >> >> are silently ignored; with this change
            we just change how
             >>> >         we record
             >>> >          >> >> them. It still feels a bit hacky though.
             >>> >          >> >>
             >>> >          >> >> > I'm not sure the `experiments` flag
            works for us. AFAIK
             >>> >         it only allows
             >>> >          >> >> > true/false flags. We want to pass
            all types of pipeline
             >>> >         options to the
             >>> >          >> >> > Runner.
             >>> >          >> >>
             >>> >          >> >> Experiments is an arbitrary set of
            strings, which can be
             >>> >         of the form
             >>> >          >> >> "param=value" if that's useful.
            (Dataflow does this.)
             >>> >         There is, again,
             >>> >          >> >> no namespacing on the param names, but
            we could user urns
             >>> >         or impose
             >>> >          >> >> some other structure here.
             >>> >          >> >>
             >>> >          >> >> > How to solve this?
             >>> >          >> >> >
             >>> >          >> >> > 1) Add all options of all Runners to
            each SDK
             >>> >          >> >> > We added some of the FlinkRunner
            options to the Python
             >>> >         SDK but realized
             >>> >          >> >> > syncing is rather cumbersome in the
            long term. However,
             >>> >         we want the most
             >>> >          >> >> > important options to be validated on
            the client side.
             >>> >          >> >>
             >>> >          >> >> I don't think this is sustainable in
            the long run.
             >>> >         However, thinking
             >>> >          >> >> about this, in the worse case
            validation happens after
             >>> >         construction
             >>> >          >> >> but before execution (as with much of
            our other
             >>> >         validation) so it
             >>> >          >> >> isn't that bad.
             >>> >          >> >>
             >>> >          >> >> > 2) Pass "unknown" options via a
            separate list in the
             >>> >         Proto which can
             >>> >          >> >> > only be accessed internally by the
            Runners. This still
             >>> >         allows passing
             >>> >          >> >> > arbitrary options but we wouldn't
            leak unknown options
             >>> >         and display them
             >>> >          >> >> > as top-level options.
             >>> >          >> >>
             >>> >          >> >> I think there needs to be a way for
            the user to
             >>> >         communicate values
             >>> >          >> >> directly to the runner regardless of
            the SDK. My
             >>> >         preference would be
             >>> >          >> >> to make this explicit, e.g. (repeated)
             >>> >         --runner_option=..., rather
             >>> >          >> >> than scooping up all unknown flags at
            command line
             >>> >         parsing time.
             >>> >          >> >> Perhaps an SDK that is aware of some
            runners could choose
             >>> >         to lift
             >>> >          >> >> these as top-level options, but still
            pass them as runner
             >>> >         options.
             >>> >          >> >>
             >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
             >>> >          >> >> > > The current release branch
             >>> >          >> >> > >

>>> >(https://github.com/apache/beam/commits/release-2.8.0) was cut

             >>> >         after the
             >>> >          >> >> > > revert went in.  Sent out
             >>> > https://github.com/apache/beam/pull/6683 as a
             >>> >          >> >> > > revert of the revert.  Regarding
            your comment above,
             >>> >         I can help out with
             >>> >          >> >> > > the design / PR reviews for common
            Python code as you
             >>> >         suggest.
             >>> >          >> >> > >
             >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM
            Thomas Weise
             >>> >         <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
             >>> >          >> >> > > <mailto:[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >     Thanks, will tag you and
            looking forward to
             >>> >         feedback so we can
             >>> >          >> >> > >     ensure that changes work for
            everyone.
             >>> >          >> >> > >
             >>> >          >> >> > >     Looking at the PR, I see
            agreement from Max to
             >>> >         revert the change on
             >>> >          >> >> > >     the release branch, but not in
            master. Would you
             >>> >         mind to restore it
             >>> >          >> >> > >     in master?
             >>> >          >> >> > >
             >>> >          >> >> > >     Thanks
             >>> >          >> >> > >
             >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40
            PM Ahmet Altay
             >>> >         <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
             >>> >          >> >> > >     <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >         On Fri, Oct 12, 2018 at
            11:31 AM, Charles
             >>> >         Chen <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
             >>> >          >> >> > >         <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >             What I mean is that a
            user may find that
             >>> >         it works for them
             >>> >          >> >> > >             to pass "--myarg blah"
            and access it as
             >>> >         "options.myarg"
             >>> >          >> >> > >             without explicitly
            defining a "my_arg"
             >>> >         flag due to the added
             >>> >          >> >> > >             logic.  This is not
            the intended behavior
             >>> >         and we may want to
             >>> >          >> >> > >             change this
            implementation detail in the
             >>> >         future.  However,
             >>> >          >> >> > >             having this logic in a
            released version
             >>> >         makes it hard to
             >>> >          >> >> > >             change this behavior
            since users may
             >>> >         erroneously depend on
             >>> >          >> >> > >             this undocumented
            behavior.  Instead, we
             >>> >         should namespace /
             >>> >          >> >> > >             scope this so that it
            is obvious that
             >>> >         this is meant for
             >>> >          >> >> > >             runner (and not Beam
            user) consumption.
             >>> >          >> >> > >
             >>> >          >> >> > >             On Fri, Oct 12, 2018
            at 10:48 AM Thomas Weise
             >>> >          >> >> > >             <[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >                 Can you please
            elaborate more what
             >>> >         practical problems
             >>> >          >> >> > >                 this introduces
            for users?
             >>> >          >> >> > >
             >>> >          >> >> > >                 I can see that
            this change allows a
             >>> >         user to specify a
             >>> >          >> >> > >                 runner specific
            option, which in the
             >>> >         future may change
             >>> >          >> >> > >                 because we decide
            to scope
             >>> >         differently. If this only
             >>> >          >> >> > >                 affects users of
            the portable Flink
             >>> >         runner (like us),
             >>> >          >> >> > >                 then no need to
            revert, because at
             >>> >         this early stage we
             >>> >          >> >> > >                 prefer something
            that works over
             >>> >         being blocked.
             >>> >          >> >> > >
             >>> >          >> >> > >                 It would also be
            really great if some
             >>> >         of the core Python
             >>> >          >> >> > >                 SDK developers
            could help out with
             >>> >         the design aspects
             >>> >          >> >> > >                 and PR reviews of
            changes that affect
             >>> >         common Python
             >>> >          >> >> > >                 code. Anyone who
            specifically wants
             >>> >         to be tagged on
             >>> >          >> >> > >                 relevant JIRAs and
            PRs?
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >         I would be happy to be
            tagged, and I can also
             >>> >         help with
             >>> >          >> >> > >         including other relevant
            folks whenever
             >>> >         possible. In general I
             >>> >          >> >> > >         think Robert, Charles,
            myself are good
             >>> >         candidates.
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                 Thanks
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                 On Fri, Oct 12,
            2018 at 10:20 AM
             >>> >         Ahmet Altay
             >>> >          >> >> > >                 <[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>> <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                     On Fri, Oct
            12, 2018 at 10:11 AM,
             >>> >         Charles Chen

>>> > >> >> > ><[email protected] <mailto:[email protected]>

             >>> >         <mailto:[email protected]
            <mailto:[email protected]>> <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >                         For
            context, I made comments on
             >>> >          >> >> > >
            https://github.com/apache/beam/pull/6600 noting
             >>> >          >> >> > >                         that the
            changes being made
             >>> >         were not good for
             >>> >          >> >> > >                         Beam
             >>> >         backwards-compatibility.  The change as is
             >>> >          >> >> > >                         allows
            users to use pipeline
             >>> >         options without
             >>> >          >> >> > >                         explicitly
            defining them,
             >>> >         which is not the type
             >>> >          >> >> > >                         of usage
            we would like to
             >>> >         encourage since we
             >>> >          >> >> > >                         prefer to
            be explicit
             >>> >         whenever possible.  If
             >>> >          >> >> > >                         users
            write pipelines with
             >>> >         this sort of pattern,
             >>> >          >> >> > >                         they will
            potentially
             >>> >         encounter pain when
             >>> >          >> >> > >                         upgrading
            to a later version
             >>> >         since this is an

>>> > >> >> > >implementation detail and not

             >>> >         an officially
             >>> >          >> >> > >                         supported
            pattern.  I agree
             >>> >         with the comments
             >>> >          >> >> > >                         above that
            this is ultimately
             >>> >         a scoping issue.
             >>> >          >> >> > >                         I would
            not have a problem
             >>> >         with these changes if
             >>> >          >> >> > >                         they were
            explicitly scoped
             >>> >         under either a
             >>> >          >> >> > >                         runner or
            unparsed options
             >>> >         namespace.
             >>> >          >> >> > >
             >>> >          >> >> > >                         As a
            second note, since the
             >>> >         2.8.0 release is
             >>> >          >> >> > >                         being cut
            right now, because
             >>> >         of these

>>> > >> >> > >backwards-compatibility

             >>> >         concerns, I would
             >>> >          >> >> > >                         suggest
            reverting these
             >>> >         changes, at least until
             >>> >          >> >> > >                         2.8.0 is
            cut, so we can have
             >>> >         a discussion here
             >>> >          >> >> > >                         before
            committing to and
             >>> >         releasing any API-level
             >>> >          >> >> > >                         changes.
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                     +1 I would
            like to revert the
             >>> >         changes in order not
             >>> >          >> >> > >                     rush this into
            the release. Once
             >>> >         this discussion
             >>> >          >> >> > >                     results in an
            agreement changes
             >>> >         can be brought back.
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                         On Fri,
            Oct 12, 2018 at 9:26
             >>> >         AM Henning Rohde

>>> > >> >> > ><[email protected] <mailto:[email protected]>

             >>> >         <mailto:[email protected]
            <mailto:[email protected]>> <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>>
             >>> >          >> >> > >                         wrote:
             >>> >          >> >> > >
             >>> >          >> >> > >                             Agree
            that pipeline
             >>> >         options lack some

>>> > >> >> > >mechanism for scoping. It

             >>> >         is also not always

>>> > >> >> > >possible distinguish

             >>> >         options meant to be

>>> > >> >> > >consumed at pipeline

             >>> >         construction time, by
             >>> >          >> >> > >                             the
            runner, by the SDK
             >>> >         harness, by the user
             >>> >          >> >> > >                             code
            or any combination
             >>> >         -- and this causes

>>> > >> >> > >confusion every now and then.

             >>> >          >> >> > >
             >>> >          >> >> > >                             For
            Dataflow, we have
             >>> >         been using

>>> > >> >> > >"experiments" for

             >>> >         arbitrary runner-specific

>>> > >> >> > >options. It's simply a

             >>> >         string list pipeline
             >>> >          >> >> > >                             option
            that all SDKs
             >>> >         support and, for Go at
             >>> >          >> >> > >                             least,
            is sent to
             >>> >         portable runners. Flink
             >>> >          >> >> > >                             can do
            the same in the
             >>> >         short term to move
             >>> >          >> >> > >                             forward.
             >>> >          >> >> > >
             >>> >          >> >> > >                             Henning
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >                             On
            Fri, Oct 12, 2018 at
             >>> >         8:50 AM Thomas Weise

>>> > >> >> > ><[email protected] <mailto:[email protected]>

             >>> >         <mailto:[email protected]
            <mailto:[email protected]>> <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>> wrote:
             >>> >          >> >> > >

>>> > >> >> > >[moving to the list]

             >>> >          >> >> > >

>>> > >> >> > >The requirement

             >>> >         driving this part of the

>>> > >> >> > >change was to allow a

             >>> >         user to specify

>>> > >> >> > >pipeline options that

             >>> >         a runner supports

>>> > >> >> > >without having to

             >>> >         declare those in each

>>> > >> >> > >language SDK.

             >>> >          >> >> > >
             >>> >          >> >> > >                                 In
            the specific
             >>> >         scenario, we have

>>> > >> >> > >options that the

             >>> >         Flink runner supports

>>> > >> >> > >(and can validate),

             >>> >         that are not

>>> > >> >> > >enumerated in the

             >>> >         Python SDK.
             >>> >          >> >> > >
             >>> >          >> >> > >                                 I
            think we have a
             >>> >         bigger problem scoping

>>> > >> >> > >pipeline options. For

             >>> >         example, the

>>> > >> >> > >runner options are

             >>> >         dumped into the SDK

>>> > >> >> > >worker. There is also

             >>> >         a possibility of

>>> > >> >> > >name collisions. So I

             >>> >         think this would

>>> > >> >> > >benefit from broader

             >>> >         feedback.
             >>> >          >> >> > >

>>> > >> >> > >Thanks,

             >>> >          >> >> > >                                 Thomas
             >>> >          >> >> > >
             >>> >          >> >> > >

>>> > >> >> > >---------- Forwarded

             >>> >         message ---------

>>> > >> >> > >From: *Charles Chen*

             >>> >          >> >> > >
             >>> >           <[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>
             >>> >          >> >> > >
             >>> >           <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>>

>>> > >> >> > >Date: Fri, Oct 12,

             >>> >         2018 at 8:36 AM

>>> > >> >> > >Subject: Re:

             >>> >         [apache/beam] [BEAM-5442]

>>> > >> >> > >Store duplicate

             >>> >         unknown options in a

>>> > >> >> > >list argument (#6600)>>> > >> >> > >To: apache/beam

             >>> >         <[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>
             >>> >          >> >> > >
             >>> >           <mailto:[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>>>

>>> > >> >> > >Cc: Thomas Weise

             >>> >         <[email protected]
            <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
             >>> >          >> >> > >
             >>> >           <mailto:[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>>>,

>>> > >> >> > >Mention

             >>> >         <[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>
             >>> >          >> >> > >
             >>> >           <mailto:[email protected]
            <mailto:[email protected]>
             >>> >         <mailto:[email protected]
            <mailto:[email protected]>>>>
             >>> >          >> >> > >
             >>> >          >> >> > >

>>> > >> >> > >CC: @tweise

             >>> >         <https://github.com/tweise>
             >>> >          >> >> > >
             >>> >          >> >> > >                                 —

>>> > >> >> > >You are receiving

             >>> >         this because you were

>>> > >> >> > >mentioned.>>> > >> >> > >Reply to this email

             >>> >         directly, view it on
             >>> >          >> >> > >                                 GitHub
             >>> >          >> >> > >

>>> ><https://github.com/apache/beam/pull/6600#issuecomment-429367754>,

             >>> >          >> >> > >                                 or
            mute the thread
             >>> >          >> >> > >

>>> ><https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.

             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >          >> >> > >
             >>> >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to