BTW what are the next steps here ? Heejong or Max, will one of you be able to 
come up with a detailed proposal around this ?

Thank you for all the additional comments and ideas. I will try to capture them in a document and share it here. Of course we can continue the discussion in the meantime.

-Max

On 30.04.19 19:02, Chamikara Jayalath wrote:

On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik <lc...@google.com <mailto:lc...@google.com>> wrote:

    We should stick with URN + payload + artifact metadata[1] where the
    only mandatory one that all SDKs and expansion services understand
    is the "bytes" artifact type. This allows us to add optional URNs
    for file://, http://, Maven, PyPi, ... in the future. I would make
    the artifact staging service use the same URN + payload mechanism to
    get compatibility of artifacts across the different services and
    also have the artifact staging service be able to be queried for the
    list of artifact types it supports. Finally, we would need to have
    environments enumerate the artifact types that they support.

    Having everyone have the same "artifact" representation would be
    beneficial since:
    a) Python environments could install dependencies from a
    requirements.txt file (something that the Google Cloud Dataflow
    Python docker container allows for today)
    b) It provides an extensible and versioned mechanism for SDKs,
    environments, and artifact staging/retrieval services to support
    additional artifact types
    c) Allow for expressing a canonical representation of an artifact
    like a Maven package so a runner could merge environments that the
    runner deems compatible.

    The flow I could see is:
    1) (optional) query artifact staging service for supported artifact
    types
    2) SDK request expansion service to expand transform passing in a
    list of artifact types the SDK and artifact staging service support,
    the expansion service returns a list of artifact types limited to
    those supported types + any supported by the environment
    3) SDK converts any artifact types that the artifact staging service
    or environment doesn't understand, e.g. pulls down Maven
    dependencies and converts them to "bytes" artifacts
    4) SDK sends artifacts to artifact staging service
    5) Artifact staging service converts any artifacts to types that the
    environment understands
    6) Environment is started and gets artifacts from the artifact
    retrieval service.


This is a very interesting proposal. I would add:
(5.5) artifact staging service resolves conflicts/duplicates for artifacts needed by different transforms of the same pipeline

BTW what are the next steps here ? Heejong or Max, will one of you be able to come up with a detailed proposal around this ?

In the meantime I suggest we add temporary pipeline options for staging Java dependencies from Python (and vice versa) to unblock development and testing of rest of the cross-language transforms stack. For example, https://github.com/apache/beam/pull/8340

Thanks,
Cham


    On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com
    <mailto:rober...@google.com>> wrote:

        On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels
        <m...@apache.org <mailto:m...@apache.org>> wrote:
         >
         > Good idea to let the client expose an artifact staging
        service that the
         > ExpansionService could use to stage artifacts. This solves
        two problems:
         >
         > (1) The Expansion Service not being able to access the Job Server
         > artifact staging service
         > (2) The client not having access to the dependencies returned
        by the
         > Expansion Server
         >
         > The downside is that it adds an additional indirection. The
        alternative
         > to let the client handle staging the artifacts returned by
        the Expansion
         > Server is more transparent and easier to implement.

        The other downside is that it may not always be possible for the
        expansion service to connect to the artifact staging service (e.g.
        when constructing a pipeline locally against a remote expansion
        service).


    Just to make sure, your saying the expansion service would return
    all the artifacts (bytes, urls, ...) as part of the response since
    the expansion service wouldn't be able to connect to the SDK that is
    running locally either.


         > Ideally, the Expansion Service won't return any dependencies
        because the
         > environment already contains the required dependencies. We
        could make it
         > a requirement for the expansion to be performed inside an
        environment.
         > Then we would already ensure during expansion time that the
        runtime
         > dependencies are available.

        Yes, it's cleanest if the expansion service provides an environment
        without all the dependencies provided. Interesting idea to make
        this a
        property of the expansion service itself.


    I had thought this too but an opaque docker container that was built
    on top of a base Beam docker container would be very difficult for a
    runner to introspect and check to see if its compatible to allow for
    fusion across PTransforms. I think artifacts need to be communicated
    in their canonical representation.


         > > In this case, the runner would (as
         > > requested by its configuration) be free to merge
        environments it
         > > deemed compatible, including swapping out beam-java-X for
         > > beam-java-embedded if it considers itself compatible with the
         > > dependency list.
         >
         > Could you explain how that would work in practice?

        Say one has a pipeline with environments

        A: beam-java-sdk-2.12-docker
        B: beam-java-sdk-2.12-docker + dep1
        C: beam-java-sdk-2.12-docker + dep2
        D: beam-java-sdk-2.12-docker + dep3

        A runner could (conceivably) be intelligent enough to know that dep1
        and dep2 are indeed compatible, and run A, B, and C in a single
        beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
        corresponding fusion and lower overhead benefits). If a certain
        pipeline option is set, it might further note that dep1 and dep2 are
        compatible with its own workers, which are build against
        sdk-2.12, and
        choose to run these in embedded + dep1 + dep2 environment.


    We have been talking about the expansion service and cross language
    transforms a lot lately but I believe it will initially come at the
    cost of poor fusion of transforms since "merging" environments that
    are compatible is a difficult problem since it brings up many of the
    dependency management issues (e.g. diamond dependency issues).

    1:
    
https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56

Reply via email to