On Wed, Aug 7, 2019 at 5:59 PM Thomas Weise <[email protected]> wrote: > >> > * The pipeline construction code itself may need access to cluster >> > resources. In such cases the jar file cannot be created offline. >> >> Could you elaborate? > > > The entry point is arbitrary code written by the user, not limited to Beam > pipeline construction alone. For example, there could be access to a file > system or other service to fetch metadata that is required to build the > pipeline. Such services can be accessed when the code runs within the > infrastructure, but typically not in a development environment.
Yes, this may be limited to the case that the pipeline construction can be done on the users machine before submission (remotely staging the executing the Python (or Go, or ...) code within the infrastructure to build the pipeline and then running the job server there is a bit more complicated). We control the entry point from then on. >> > * For k8s deployment, a container image with the SDK and application code >> > is required for the worker. The jar file (which is really a derived >> > artifact) would need to be built in addition to the container image. >> >> Yes. For standard use, a vanilla released Beam published SDK container >> + staged artifacts should be sufficient. >> >> > * To build such jar file, the user would need a build environment with job >> > server and application code. Do we want to make that assumption? >> >> Actually, it's probably much easier than that. A jar file is just a >> zip file with a standard structure, to which one can easily add (data) >> files without having a full build environment. The (pre-compiled) main >> class would know how to read this data to construct the pipeline and >> kick off the job just like any other Flink job. > > Before assembling the jar, the job server runs to create the ingredients. > That requires the (matching) Java environment on the Python developers > machine. We can run the job server and have it create the jar (and if we keep the job server running we can use it to interact with the running job). However, if the jar layout is simple enough, there's no need to even build it from Java. Taken to the extreme, this is a one-shot, jar-based JobService API. We choose a standard layout of where to put the pipeline description and artifacts, and can "augment" an existing jar (that has a runner-specific main class whose entry point knows how to read this data to kick off a pipeline as if it were a users driver code) into one that has a portable pipeline packaged into it for submission to a cluster.
