On 8/17/21 10:40 PM, Luke Cwik wrote:


On Tue, Aug 17, 2021 at 1:28 PM Chamikara Jayalath <chamik...@google.com <mailto:chamik...@google.com>> wrote:



    On Tue, Aug 17, 2021 at 1:01 PM Luke Cwik <lc...@google.com
    <mailto:lc...@google.com>> wrote:

        I see the language differences but still feel as though there
        is a pretty common base that would work for object oriented
        languages and another for non-object oriented languages.


    For now, the only property I think that can be clearly moved to a
    common base is "class_name". I felt like adding a base just for
    that is overkill but any suggestions to the PR are welcome :)


        Authentication won't provide the right type of protection. For
        example if GCP hosted an expansion service, any GCP customer
        should be able to authenticate to use it but that wouldn't
        mean that GCP would want arbitrary code to be executed. We
        could have an allowlist of classes and methods that are able
        to be invoked via this pattern.


    Yeah, additional authentication mechanisms can be introduced to
    make this safer. I think the bottom line is that the ability to
    invoke Java transforms without introducing new Java code can be
    appealing to many non-Java cross-language transform users. Also,
    the proposed solution does not let users execute arbitrary Java
    code. They are simply invoking classes/methods that are already
    available in the expansion service in a controlled way. We can
    introduce additional authentication mechanisms, allowlists etc. to
    make this even safer.


        Finally, this solution does break the abstraction of "hey I
        want to execute BigQuery read with these parameters" since the
        current proposal is about how to construct such a transform
        via some method calls. I believe this will expose more sharp
        edges around pipeline author and expansion service versioning
        issues and places the onus onto the pipeline author or
        expansion service to not break anything.


    It does support instantiating a transform using constructor
    parameters or a constructor method with parameters, and builder
    methods. For example, BigQueryIO.readTableRows().from(String
    tableSpec), where  "BigQueryIO" is the class name, "readTableRows"
    is the constructor method, and "from(String tableSpec)" is a
    builder method. Did I miss any common pattern in addition to this ?


The issue isn't that you support building a BigQueryIO transform for reading, it is that the way that the proto defines the XLang transform is coupled directly to the code in how the transform is built. For example if someone wanted to create a C++ or Python  version of the expansion service then that someone would need to translate the Java code directly with all of its methods. The alternative is that every XLang transform has a specification and there is a specification to transform adapter that sits between the specification and the implementation of how that transform is constructed. This level of indirection provides a lot of value related to renames of fields/methods, versioning, implementations in other languages and security.
This is a good point. Sounds like we would like to verify client-server API compatibility, would a client's Beam version sent in the request solve that? The server could then verify that it has the same version (which is the most strict condition, but mostly safe one). It would be nice if this information could be used to route the request to a specific instance in the case of the hosted expansion service (multiple versions hosted behind the same hostname). I'm not that familiar with gRPC, but could this be added to the request's metadata somehow? And could that information be used for the reverse-proxying?


    Thanks,
    Cham



        On Tue, Aug 17, 2021 at 12:38 PM Chamikara Jayalath
        <chamik...@google.com <mailto:chamik...@google.com>> wrote:



            On Tue, Aug 17, 2021 at 11:52 AM Luke Cwik
            <lc...@google.com <mailto:lc...@google.com>> wrote:

                Thanks, I was able to finally take a look.

                I totally agree that this would be applicable to any
                language so replacing Java specific idioms with
                general language concepts but I think the risk is that
                no hosted expansion service would want to have support
                for unchecked call this method with this parameter
                since it is too large a security risk. Code/features
                like this is a common reason for CVE's being created.


            Actually the discussion regarding generalization of the
            payload for all languages evolved a bit [1] in the doc.

            I think the invocation patterns of different languages are
            different enough to warrant different URNs
            (PayloadTypeUrns) and payload types for instantiating
            transforms implemented in different languages. For
            example, Java requires a class name and builder methods.
            Parameters of the constructor have to be ordered.
            Python usually uses keyword arguments where the ordering
            doesn't matter. Go meanwhile does not have a concept of
            classes.

            Regarding vulnerabilities, I think one solution might be
            to introduce some sort of an authentication mechanism for
            ExpansionServices so that expansion requests can be
            properly authenticated. Currently we only use expansion
            services that are local processes so I think this can be
            left out of this proposal but this is something we should
            add to properly support remote expansion services.

            Thanks,
            Cham

            [1]
            
https://docs.google.com/document/d/1ECXSWicE31K-vSxdb4qL6UcmovOAWvE-ZHFT3NTM654/edit?disco=AAAANjeR9CU
            
<https://docs.google.com/document/d/1ECXSWicE31K-vSxdb4qL6UcmovOAWvE-ZHFT3NTM654/edit?disco=AAAANjeR9CU>


                On Tue, Aug 17, 2021 at 11:03 AM Chamikara Jayalath
                <chamik...@google.com <mailto:chamik...@google.com>>
                wrote:

                    Thanks for all the comments in the doc.

                    I created [1] for tracking and opened up a pull
                    request for proto and Java updates:
                    https://github.com/apache/beam/pull/15343
                    <https://github.com/apache/beam/pull/15343>

                    Thanks,
                    Cham

                    [1]
                    https://issues.apache.org/jira/browse/BEAM-12769
                    <https://issues.apache.org/jira/browse/BEAM-12769>

                    On Tue, Jul 27, 2021 at 6:28 PM Robert Bradshaw
                    <rober...@google.com <mailto:rober...@google.com>>
                    wrote:

                        On Tue, Jul 27, 2021 at 1:31 PM Chamikara
                        Jayalath <chamik...@google.com
                        <mailto:chamik...@google.com>> wrote:
                        >
                        > On Tue, Jul 27, 2021 at 11:03 AM Andrew
                        Pilloud <apill...@google.com
                        <mailto:apill...@google.com>> wrote:
                        >>
                        >> Hi Cham,
                        >>
                        >> Are you aware of the SchemaIO and
                        SchemaIOProvider interfaces (and their
                        design)? One of SchemaIO's goal is putting a
                        generic interface on IOs so users don't have
                        to construct wrappers for cross language use.
                        It looks like this new interface can probably
                        construct a Java SchemaIO, so it sounds
                        reasonable to me. (This might be something
                        worth testing when you implement it.)
                        >>
                        >> We are starting to add additional
                        functionality (support for automatic
                        optimizations, such as filter and project
                        push-down). I'm not sure how this is going to
                        work cross language yet, but we will probably
                        end up adding metadata needed to reconstruct
                        the transform to the portability proto.
                        >
                        > Went through it a bit and I think the two
                        designs are complementary. Schema aware IO
                        will allow some I/O transform authors to allow
                        easily accessing transforms from a remote SDK
                        using a SQL query while the current proposal
                        makes defining/using Java transforms easier
                        for non-Java programmers. I think both
                        proposals will help reduce the barrier to
                        entry for cross-language and will help make
                        more Java transforms available to other SDKs.

                        SQL benefits from being able to declare an IO
                        in textual form.
                        Cross-language seeks to establish a standard
                        to describe an IO in a
                        language-agnostic form. At their core is the
                        desire to be able to
                        instantiate an IO based on a name (which is
                        likely linked to an
                        implementation via a registrar) and a set of
                        named parameters of
                        "basic" type. I would hope that any IO that
                        offers a generic SchemaIO
                        interface will be trivially wrappable as an
                        external transform.

                        I do agree, however, that this external
                        transform is more general than
                        just IOs and transforms accepting/providing
                        Row types.

Reply via email to