On Mon, May 27, 2019 at 1:59 PM Maximilian Michels <[email protected]> wrote: > > > I think it makes a lot of sense for job servers to also act as > > expansion services, but one can't of course defer expansion until job > > submission. > > One could defer the expansion until job submission but it would be a > semantic change to how expansion works currently. In particular with > respect to providing feedback to the user during expansion and with > regard to immutability of pipelines, this would not be a good choice.
It's also not always possible, e.g. the results (their types in particular, but also even how many) may not be known until after expansion. > As for the Flink job server, it already hosts an expansion server. It > would make sense to let them share the same GRPC server which would > avoid having to know the port of the expansion server. +1 And, on a pragmatic note, it'd be good to share the port with the artifact server as well, in which case the job server could say "serve artifacts to me" without having to worry about any intervening port forwarding, etc. that sits between it and the sdk. > On 27.05.19 13:33, Robert Bradshaw wrote: > > On Mon, May 27, 2019 at 12:38 PM Maximilian Michels <[email protected]> wrote: > >> > >>> Which environment would be used to perform the expansion? I think this is > >>> an interesting option, as long as it does not introduce a hard dependency > >>> on docker. > >> > >> The same environment that the to-be-expanded transform requires during > >> runtime. > >> > >>> Dataflow has been doing something similar in this route where it is > >>> trying to get rid of the driver program running on the users machine. If > >>> you can get the expansion service to launch and run an environment to > >>> perform the expansion, you could also get it to create and submit a job > >>> as well returning data around the running job. > >> > >> Portability already runs without a driver on the user machine, apart > >> from expansion and staging. For anything runtime-related the job server > >> kicks in. It's worth to think about delegating expansion and staging to > >> the Job server. > > > > I think it makes a lot of sense for job servers to also act as > > expansion services, but one can't of course defer expansion until job > > submission. > > > >> On 24.05.19 23:48, Lukasz Cwik wrote: > >>> Dataflow has been doing something similar in this route where it is > >>> trying to get rid of the driver program running on the users machine. If > >>> you can get the expansion service to launch and run an environment to > >>> perform the expansion, you could also get it to create and submit a job > >>> as well returning data around the running job. > >>> > >>> On Thu, May 23, 2019 at 7:47 AM Thomas Weise <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> > >>> > >>> On Thu, May 23, 2019 at 3:46 AM Maximilian Michels <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> > Writing a new transform involves updating the expansion > >>> service to include their new transform. > >>> > >>> Would it be conceivable that the expansion is performed via the > >>> environment? That would solve the problem of updating the > >>> expansion > >>> service, although it adds additional complexity for bringing up > >>> the > >>> environment. > >>> > >>> > >>> Which environment would be used to perform the expansion? I think > >>> this is an interesting option, as long as it does not introduce a > >>> hard dependency on docker. > >>> > >>> On 23.05.19 11:31, Robert Bradshaw wrote: > >>> > On Wed, May 22, 2019 at 6:17 PM Maximilian Michels > >>> <[email protected] <mailto:[email protected]> > >>> > <mailto:[email protected] <mailto:[email protected]>>> wrote: > >>> > > >>> > Hi, > >>> > > >>> > Robert and me were discussing on the subject of > >>> user-specified > >>> > environments for external transforms [1]. We couldn't > >>> decide whether > >>> > users should have direct control over the environment > >>> when they use an > >>> > external transform in their pipeline. > >>> > > >>> > In my mind, it is quite natural that the Expansion > >>> Service is a > >>> > long-running service that gets started with a list of > >>> available > >>> > environments. > >>> > > >>> > > >>> > +1. > >>> > > >>> > IMHO, the expansion service should be expected to provide > >>> valid > >>> > environments for the transforms it vendors. Removing this > >>> expectation > >>> > seems wrong. Making it cheap to specify non-default > >>> dependencies without > >>> > building (publishing, etc.) a docker image is probably key to > >>> making > >>> > this work well (and also allowing more powerful environment > >>> introspection). > >>> > > >>> > Such a list can be outdated and users may write transforms > >>> > for a new environment they want to use in their pipeline. > >>> > > >>> > > >>> > This is the part that I'm having trouble following. Writing a > >>> new > >>> > transform involves updating the expansion service to include > >>> their new > >>> > transform. The author of a transform (in other words, the one > >>> who > >>> > defines its expansion and implementation) is in the position > >>> to name its > >>> > dependencies, etc. and the user of the transform (the one > >>> invoking it) > >>> > is not in a generally good position to know what environments > >>> would be > >>> > valid. > >>> > > >>> > The easiest > >>> > way would be to allow to pass the environment with the > >>> transform. > >>> > > >>> > > >>> > What this allows is using existing transforms in new > >>> environments. There > >>> > are possibly some usecases for this, e.g. expansion of a > >>> given transform > >>> > may be compatible with ether version X or version Y of a > >>> library, left > >>> > up to the discretion of the caller, but I think that this is > >>> really just > >>> > a deficiency in our environment specifications (e.g. it one > >>> should be > >>> > able to express this flexibility in the returned environment). > >>> > > >>> > Note > >>> > that we already give users control over the "main" > >>> environment via the > >>> > PortablePipelineOptions, so this wouldn't be an entirely > >>> new concept. > >>> > > >>> > > >>> > Yes, the author of a pipeline/transform chooses the > >>> environment in which > >>> > those transforms execute. > >>> > > >>> > The contrary position is that the Expansion Service > >>> should have full > >>> > control over which environment is chosen. Going back to > >>> the discussion > >>> > about artifact staging [2], this could enable to perform > >>> more > >>> > optimizations, such as merging environments or detecting > >>> conflicts. > >>> > However, this only works if this information has been > >>> provided upfront > >>> > to the Expansion Service. It wouldn't be impossible to > >>> provide these > >>> > hints alongside with the environment like suggested in > >>> the previous > >>> > paragraph. > >>> > > >>> > Any opinions? Should we allow users to optionally specify > >>> an > >>> > environment > >>> > for external transforms? > >>> > > >>> > Thanks, > >>> > Max > >>> > > >>> > [1] https://github.com/apache/beam/pull/8639 > >>> > [2] > >>> > > >>> > >>> https://lists.apache.org/thread.html/6fcee7047f53cf1c0636fb65367ef70842016d57effe2e5795c4137d@%3Cdev.beam.apache.org%3E > >>> > > >>>
