On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <m...@apache.org> wrote:
> Thank you Robert and Lukasz for your points. > > > Note that I believe that we will want to have multiple URLs to support > cross language pipelines since we will want to be able to ask other SDK > languages/versions for their "list" of supported PipelineOptions. > > Why is that? The Runner itself is the source of truth for its options. > Because other languages (or even different versions of the same language) may have their own options. For example, the Go SDK talks to a Java service which applies a SQL transform and returns the expanded form (this may require knowledge of some options like credentials for the filesystem, ...) and then talks to a Python service that performs another transform expansion. Finally the pipeline containing Go, Java and Python transforms is submitted to a runner and it performs its own internal replacements/expansions related to executing the pipeline. > Everything else is SDK-related and should be validated there. > > I imagined the process to go like this: > > a) Parse options to find JobServer URL > a) Retrieve options from JobServer > c) Parse all options > ...continue as always... > > An option is just represented by a name and a type. There is nothing > more to it, at least as of now. So it should be possible to parse them > in the SDK without much further work. > > Nevertheless, I agree with your sentiment, Robert. The "runner_option" > flag would prevent additional complexity. I still don't prefer it > because it's not nice from an end user perspective. If we were to > implement it, I would definitely go for the "option promotion" which you > mentioned. > > I hadn't thought about delegating runners, although the PortableRunner > is basically a delegating Runner. If that was an important feature, I > suppose the "runner_option" would be the preferred way. > > All in all, since there doesn't seem to be an excitement to implement > JobServer option retrieval and we will need the help of all SDK > developers, "runner_option" seems to be the more likely path. > I would say its a lack of time for people to improve this story over others but it can be revisited at some point in the future and I agree that using --runner_option as an interim provides value. > > -Max > > On 08.11.18 21:50, Lukasz Cwik wrote: > > The purpose of the spec would be to provide the names, type and > > descriptions of the options. We don't need anything beyond the JSON > > types (string, number, bool, object, list) because the only ambiguity we > > get is how do we parse command line string into the JSON type (and that > > ambiguity is actually only between string and non-string since all the > > other JSON types are unambiguous). > > > > Also, I believe the flow would be > > 1) Parse options > > a) Find the URL from args specified and/or additional methods on > > PipelineOptions that exposes a programmatic way to set the URL during > > parsing. > > b) Query URL for option specs > > c) Parse the remainder of the options > > 2) Construct pipeline > > 3) Choose runner > > 4) Submit job to runner > > > > Note that I believe that we will want to have multiple URLs to support > > cross language pipelines since we will want to be able to ask other SDK > > languages/versions for their "list" of supported PipelineOptions. > > > > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <rober...@google.com > > <mailto:rober...@google.com>> wrote: > > > > There's two questions here: > > > > (A) What do we do in the short term? > > > > I think adding every runner option to every SDK is not sustainable > > (n*m work, assuming every SDK knows about every runner), and having a > > patchwork of options that were added as one-offs to SDKs is not > > desirable either. Furthermore, it seems difficult to parse unknown > > options as if they were valid options, so my preference here would be > > to just use a special runner_option flag. (One could also pass a set > > of unparsed/unvalidated runner options to the runner, even if they're > > not distinguished for the user, and runners (or any intermediates) > > could run a "promote" operation that promotes any of these unknowns > > that they recognize to real options before further processing. The > > parsing would be done as repeated-string, and not be intermingled > with > > the actually validated options. This is essential a variant of > > option 1.) > > > > (B) What do do in the long term? While the JobServer approach sounds > > nice, I think it introduces a lot of complexity (we have too much of > > that already) and still doesn't completely solve the problem. In > > particular, it changes the flow from > > > > 1. Parse options > > 2. Construct pipeline > > 3. Choose runner > > 4. Submit job to runner > > > > to > > > > 1. Parse options > > 2. Construct pipeline > > 3. Choose runner > > 4a. Query runner for option specs > > 4b. Re-parse options > > 4c. Submit job to runner > > > > In particular, doing 4b in the SDK rather than just let the runner > > itself do the validation as part of (4) doesn't save much and forces > > us to come up with a (probably incomplete) spec as to how to define > > options, their types, and their validations. It also means that a > > delegating runner must choose and interact with its downstream > > runner(s) synchronously, else we haven't actually solved the issue. > > > > For these reasons, I don't think we even want to go with the > JobServer > > approach in the long term, which has bearing on (A). > > > > - Robert > > > > > > On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <m...@apache.org > > <mailto:m...@apache.org>> wrote: > > > > > > +1 > > > > > > If the preferred approach is to eventually have the JobServer > > serve the > > > options, then the best intermediate solution is to replicate > common > > > options in the SDKs. > > > > > > If we went down the "--runner_option" path, we would end up with > > > multiple ways of specifying the same options. We would eventually > > have > > > to deprecate "runner options" once we have the JobServer > > approach. I'd > > > like to avoid that. > > > > > > For the upcoming release we can revert the changes again and add > the > > > most common missing options to the SDKs. Then hopefully we should > > have > > > fetching implemented for the release after. > > > > > > Do you think that is feasible? > > > > > > Thanks, > > > Max > > > > > > On 30.10.18 23:00, Lukasz Cwik wrote: > > > > I still like #3 the most, just can't devote the time to get it > > done. > > > > > > > > Instead of going with a fully implemented #3, we could hardcode > > the a > > > > subset of options and types within each SDK until the job > server is > > > > ready to provide this information and then migrate to the > > "full" list. > > > > This would be an easy path for SDKs to take on. They could > > "know" of a > > > > few well known options, and if they want to support all > > options, they > > > > implement the integration with the job server. > > > > > > > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels > > <m...@apache.org <mailto:m...@apache.org> > > > > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote: > > > > > > > > > I would prefer we don't introduce a (quirky) way of > passing > > > > unknown options that forces users to type JSON into the > > command line > > > > (or similar acrobatics) > > > > Same here, the JSON approach seems technically nice but too > > bulky > > > > for users. > > > > > > > > > To someone wanting to run a pipeline, all options are > > equally > > > > important, whether they are application specific, SDK > > specific or > > > > runner specific. > > > > > > > > I'm also reluctant to force users to use `--runner_option=` > > because the > > > > division into "Runner" options and other options seems > > rather arbitrary > > > > to users. Most built-in options are also Runner-related. > > > > > > > > > It should be possible to *optionally* qualify/scope (to > > cover > > > > cases where there is ambiguity), but otherwise I prefer the > > format > > > > we currently have. > > > > > > > > Yes, namespacing is a problem. What happens if the user > > defines a > > > > custom > > > > PipelineOption which clashes with one of the builtin ones? > > If both are > > > > > > > > set, which one is actually applied? > > > > > > > > > > > > Note that PipelineOptions so far has been treating name > > equality to mean > > > > option equality and the Java implementation has a bunch of > > strict checks > > > > to make sure that default values aren't used for duplicate > > definitions, > > > > they have the same type, etc... > > > > With 1), you fail the job if the runner can't understand your > > option > > > > because its not represented the same way. User then needs to > fix-up > > > > their declaration of the option name. > > > > With 2), there are no name conflicts, the SDK will need to > > validate that > > > > the option isn't set in both formats and error out if it is > before > > > > pipeline submission time. > > > > With 3), you can prefetch all the options and error out to the > user > > > > during argument parsing time. > > > > > > > > > > > > > > > > Here is a summary of the possible paths going forward: > > > > > > > > > > > > 1) Validate PipelineOptions at Runner side > > > > ========================================== > > > > > > > > The main issue raised here was that we want to move away > > from parsing > > > > arguments which look like options without validating them. > > An easy fix > > > > would be to actually validate them on the Runner side. This > > could be > > > > done by changing the deserialization code of > > PipelineOptions which so > > > > far ignores unknown JSON options. > > > > > > > > See: PipelineOptionsTranslation.fromProto(Struct > protoOptions) > > > > > > > > Actually, this wouldn't work for user-defined > > PipelineOptions because > > > > they might not be known to the Runner (if they are defined > > in Python). > > > > > > > > > > > > 2) Introduce a Runner-Option Flag > > > > ================================= > > > > > > > > In this approach we would try to add as many pipeline > > options for a > > > > Runner to the SDK, but allow additional Runner options to > > be passed > > > > using the `--runner-option=key=val` flag. The Runner, like > > in 1), would > > > > have to ensure validation. I think this has been the most > > favored > > > > way so > > > > far. Going forward, that means that `--parallelism=4` and > > > > `--runner-option=parallelism=4` will have the same effect > > for the Flink > > > > Runner. > > > > > > > > > > > > 3) Implement Fetching of Options from JobServer > > > > =============================================== > > > > > > > > The options are retrieved from the JobServer before > > submitting the > > > > pipeline. I think this would be ideal but, as mentioned > > before, it > > > > increases the complexity for implementing new SDKs and > > might overall > > > > just not be worth the effort. > > > > > > > > > > > > What do you think? I'd implement 2) for the next release, > > unless there > > > > are advocates for a different approach. > > > > > > > > Cheers, > > > > Max > > >