Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Lukasz Cwik Mon, 12 Nov 2018 14:31:01 -0800

On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <[email protected]> wrote:


> Thank you Robert and Lukasz for your points.
>
> > Note that I believe that we will want to have multiple URLs to support
> cross language pipelines since we will want to be able to ask other SDK
> languages/versions for their "list" of supported PipelineOptions.
>
> Why is that? The Runner itself is the source of truth for its options.
>

Because other languages (or even different versions of the same language)
may have their own options. For example, the Go SDK talks to a Java service
which applies a SQL transform and returns the expanded form (this may
require knowledge of some options like credentials for the filesystem, ...)
and then talks to a Python service that performs another transform
expansion. Finally the pipeline containing Go, Java and Python transforms
is submitted to a runner and it performs its own internal
replacements/expansions related to executing the pipeline.


> Everything else is SDK-related and should be validated there.
>
> I imagined the process to go like this:
>
>    a) Parse options to find JobServer URL
>    a) Retrieve options from JobServer
>    c) Parse all options
>    ...continue as always...
>
> An option is just represented by a name and a type. There is nothing
> more to it, at least as of now. So it should be possible to parse them
> in the SDK without much further work.
>
> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
> flag would prevent additional complexity. I still don't prefer it
> because it's not nice from an end user perspective. If we were to
> implement it, I would definitely go for the "option promotion" which you
> mentioned.
>
> I hadn't thought about delegating runners, although the PortableRunner
> is basically a delegating Runner. If that was an important feature, I
> suppose the "runner_option" would be the preferred way.
>
> All in all, since there doesn't seem to be an excitement to implement
> JobServer option retrieval and we will need the help of all SDK
> developers, "runner_option" seems to be the more likely path.
>

I would say its a lack of time for people to improve this story over others
but it can be revisited at some point in the future and I agree that using
--runner_option as an interim provides value.


>
> -Max
>
> On 08.11.18 21:50, Lukasz Cwik wrote:
> > The purpose of the spec would be to provide the names, type and
> > descriptions of the options. We don't need anything beyond the JSON
> > types (string, number, bool, object, list) because the only ambiguity we
> > get is how do we parse command line string into the JSON type (and that
> > ambiguity is actually only between string and non-string since all the
> > other JSON types are unambiguous).
> >
> > Also, I believe the flow would be
> > 1) Parse options
> >    a) Find the URL from args specified and/or additional methods on
> > PipelineOptions that exposes a programmatic way to set the URL during
> > parsing.
> >    b) Query URL for option specs
> >    c) Parse the remainder of the options
> > 2) Construct pipeline
> > 3) Choose runner
> > 4) Submit job to runner
> >
> > Note that I believe that we will want to have multiple URLs to support
> > cross language pipelines since we will want to be able to ask other SDK
> > languages/versions for their "list" of supported PipelineOptions.
> >
> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     There's two questions here:
> >
> >     (A) What do we do in the short term?
> >
> >     I think adding every runner option to every SDK is not sustainable
> >     (n*m work, assuming every SDK knows about every runner), and having a
> >     patchwork of options that were added as one-offs to SDKs is not
> >     desirable either. Furthermore, it seems difficult to parse unknown
> >     options as if they were valid options, so my preference here would be
> >     to just use a special runner_option flag. (One could also pass a set
> >     of unparsed/unvalidated runner options to the runner, even if they're
> >     not distinguished for the user, and runners (or any intermediates)
> >     could run a "promote" operation that promotes any of these unknowns
> >     that they recognize to real options before further processing. The
> >     parsing would be done as repeated-string, and not be intermingled
> with
> >     the actually validated options. This is essential a variant of
> >     option 1.)
> >
> >     (B) What do do in the long term? While the JobServer approach sounds
> >     nice, I think it introduces a lot of complexity (we have too much of
> >     that already) and still doesn't completely solve the problem. In
> >     particular, it changes the flow from
> >
> >     1. Parse options
> >     2. Construct pipeline
> >     3. Choose runner
> >     4. Submit job to runner
> >
> >     to
> >
> >     1. Parse options
> >     2. Construct pipeline
> >     3. Choose runner
> >     4a. Query runner for option specs
> >     4b. Re-parse options
> >     4c. Submit job to runner
> >
> >     In particular, doing 4b in the SDK rather than just let the runner
> >     itself do the validation as part of (4) doesn't save much and forces
> >     us to come up with a (probably incomplete) spec as to how to define
> >     options, their types, and their validations. It also means that a
> >     delegating runner must choose and interact with its downstream
> >     runner(s) synchronously, else we haven't actually solved the issue.
> >
> >     For these reasons, I don't think we even want to go with the
> JobServer
> >     approach in the long term, which has bearing on (A).
> >
> >     - Robert
> >
> >
> >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <[email protected]
> >     <mailto:[email protected]>> wrote:
> >      >
> >      > +1
> >      >
> >      > If the preferred approach is to eventually have the JobServer
> >     serve the
> >      > options, then the best intermediate solution is to replicate
> common
> >      > options in the SDKs.
> >      >
> >      > If we went down the "--runner_option" path, we would end up with
> >      > multiple ways of specifying the same options. We would eventually
> >     have
> >      > to deprecate "runner options" once we have the JobServer
> >     approach. I'd
> >      > like to avoid that.
> >      >
> >      > For the upcoming release we can revert the changes again and add
> the
> >      > most common missing options to the SDKs. Then hopefully we should
> >     have
> >      > fetching implemented for the release after.
> >      >
> >      > Do you think that is feasible?
> >      >
> >      > Thanks,
> >      > Max
> >      >
> >      > On 30.10.18 23:00, Lukasz Cwik wrote:
> >      > > I still like #3 the most, just can't devote the time to get it
> >     done.
> >      > >
> >      > > Instead of going with a fully implemented #3, we could hardcode
> >     the a
> >      > > subset of options and types within each SDK until the job
> server is
> >      > > ready to provide this information and then migrate to the
> >     "full" list.
> >      > > This would be an easy path for SDKs to take on. They could
> >     "know" of a
> >      > > few well known options, and if they want to support all
> >     options, they
> >      > > implement the integration with the job server.
> >      > >
> >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
> >     <[email protected] <mailto:[email protected]>
> >      > > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >      > >
> >      > >      > I would prefer we don't introduce a (quirky) way of
> passing
> >      > >     unknown options that forces users to type JSON into the
> >     command line
> >      > >     (or similar acrobatics)
> >      > >     Same here, the JSON approach seems technically nice but too
> >     bulky
> >      > >     for users.
> >      > >
> >      > >      > To someone wanting to run a pipeline, all options are
> >     equally
> >      > >     important, whether they are application specific, SDK
> >     specific or
> >      > >     runner specific.
> >      > >
> >      > >     I'm also reluctant to force users to use `--runner_option=`
> >     because the
> >      > >     division into "Runner" options and other options seems
> >     rather arbitrary
> >      > >     to users. Most built-in options are also Runner-related.
> >      > >
> >      > >      > It should be possible to *optionally* qualify/scope (to
> >     cover
> >      > >     cases where there is ambiguity), but otherwise I prefer the
> >     format
> >      > >     we currently have.
> >      > >
> >      > >     Yes, namespacing is a problem. What happens if the user
> >     defines a
> >      > >     custom
> >      > >     PipelineOption which clashes with one of the builtin ones?
> >     If both are
> >      > >
> >      > >     set, which one is actually applied?
> >      > >
> >      > >
> >      > > Note that PipelineOptions so far has been treating name
> >     equality to mean
> >      > > option equality and the Java implementation has a bunch of
> >     strict checks
> >      > > to make sure that default values aren't used for duplicate
> >     definitions,
> >      > > they have the same type, etc...
> >      > > With 1), you fail the job if the runner can't understand your
> >     option
> >      > > because its not represented the same way. User then needs to
> fix-up
> >      > > their declaration of the option name.
> >      > > With 2), there are no name conflicts, the SDK will need to
> >     validate that
> >      > > the option isn't set in both formats and error out if it is
> before
> >      > > pipeline submission time.
> >      > > With 3), you can prefetch all the options and error out to the
> user
> >      > > during argument parsing time.
> >      > >
> >      > >
> >      > >
> >      > >     Here is a summary of the possible paths going forward:
> >      > >
> >      > >
> >      > >     1) Validate PipelineOptions at Runner side
> >      > >     ==========================================
> >      > >
> >      > >     The main issue raised here was that we want to move away
> >     from parsing
> >      > >     arguments which look like options without validating them.
> >     An easy fix
> >      > >     would be to actually validate them on the Runner side. This
> >     could be
> >      > >     done by changing the deserialization code of
> >     PipelineOptions which so
> >      > >     far ignores unknown JSON options.
> >      > >
> >      > >     See: PipelineOptionsTranslation.fromProto(Struct
> protoOptions)
> >      > >
> >      > >     Actually, this wouldn't work for user-defined
> >     PipelineOptions because
> >      > >     they might not be known to the Runner (if they are defined
> >     in Python).
> >      > >
> >      > >
> >      > >     2) Introduce a Runner-Option Flag
> >      > >     =================================
> >      > >
> >      > >     In this approach we would try to add as many pipeline
> >     options for a
> >      > >     Runner to the SDK, but allow additional Runner options to
> >     be passed
> >      > >     using the `--runner-option=key=val` flag. The Runner, like
> >     in 1), would
> >      > >     have to ensure validation. I think this has been the most
> >     favored
> >      > >     way so
> >      > >     far. Going forward, that means that `--parallelism=4` and
> >      > >     `--runner-option=parallelism=4` will have the same effect
> >     for the Flink
> >      > >     Runner.
> >      > >
> >      > >
> >      > >     3) Implement Fetching of Options from JobServer
> >      > >     ===============================================
> >      > >
> >      > >     The options are retrieved from the JobServer before
> >     submitting the
> >      > >     pipeline. I think this would be ideal but, as mentioned
> >     before, it
> >      > >     increases the complexity for implementing new SDKs and
> >     might overall
> >      > >     just not be worth the effort.
> >      > >
> >      > >
> >      > >     What do you think? I'd implement 2) for the next release,
> >     unless there
> >      > >     are advocates for a different approach.
> >      > >
> >      > >     Cheers,
> >      > >     Max
> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to