On 8/17/21 10:40 PM, Luke Cwik wrote:
On Tue, Aug 17, 2021 at 1:28 PM Chamikara Jayalath
<chamik...@google.com <mailto:chamik...@google.com>> wrote:
On Tue, Aug 17, 2021 at 1:01 PM Luke Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
I see the language differences but still feel as though there
is a pretty common base that would work for object oriented
languages and another for non-object oriented languages.
For now, the only property I think that can be clearly moved to a
common base is "class_name". I felt like adding a base just for
that is overkill but any suggestions to the PR are welcome :)
Authentication won't provide the right type of protection. For
example if GCP hosted an expansion service, any GCP customer
should be able to authenticate to use it but that wouldn't
mean that GCP would want arbitrary code to be executed. We
could have an allowlist of classes and methods that are able
to be invoked via this pattern.
Yeah, additional authentication mechanisms can be introduced to
make this safer. I think the bottom line is that the ability to
invoke Java transforms without introducing new Java code can be
appealing to many non-Java cross-language transform users. Also,
the proposed solution does not let users execute arbitrary Java
code. They are simply invoking classes/methods that are already
available in the expansion service in a controlled way. We can
introduce additional authentication mechanisms, allowlists etc. to
make this even safer.
Finally, this solution does break the abstraction of "hey I
want to execute BigQuery read with these parameters" since the
current proposal is about how to construct such a transform
via some method calls. I believe this will expose more sharp
edges around pipeline author and expansion service versioning
issues and places the onus onto the pipeline author or
expansion service to not break anything.
It does support instantiating a transform using constructor
parameters or a constructor method with parameters, and builder
methods. For example, BigQueryIO.readTableRows().from(String
tableSpec), where "BigQueryIO" is the class name, "readTableRows"
is the constructor method, and "from(String tableSpec)" is a
builder method. Did I miss any common pattern in addition to this ?
The issue isn't that you support building a BigQueryIO transform for
reading, it is that the way that the proto defines the XLang
transform is coupled directly to the code in how the transform is
built. For example if someone wanted to create a C++ or Python
version of the expansion service then that someone would need to
translate the Java code directly with all of its methods. The
alternative is that every XLang transform has a specification and
there is a specification to transform adapter that sits between the
specification and the implementation of how that transform is
constructed. This level of indirection provides a lot of value related
to renames of fields/methods, versioning, implementations in other
languages and security.
This is a good point. Sounds like we would like to verify client-server
API compatibility, would a client's Beam version sent in the request
solve that? The server could then verify that it has the same version
(which is the most strict condition, but mostly safe one). It would be
nice if this information could be used to route the request to a
specific instance in the case of the hosted expansion service (multiple
versions hosted behind the same hostname). I'm not that familiar with
gRPC, but could this be added to the request's metadata somehow? And
could that information be used for the reverse-proxying?
Thanks,
Cham
On Tue, Aug 17, 2021 at 12:38 PM Chamikara Jayalath
<chamik...@google.com <mailto:chamik...@google.com>> wrote:
On Tue, Aug 17, 2021 at 11:52 AM Luke Cwik
<lc...@google.com <mailto:lc...@google.com>> wrote:
Thanks, I was able to finally take a look.
I totally agree that this would be applicable to any
language so replacing Java specific idioms with
general language concepts but I think the risk is that
no hosted expansion service would want to have support
for unchecked call this method with this parameter
since it is too large a security risk. Code/features
like this is a common reason for CVE's being created.
Actually the discussion regarding generalization of the
payload for all languages evolved a bit [1] in the doc.
I think the invocation patterns of different languages are
different enough to warrant different URNs
(PayloadTypeUrns) and payload types for instantiating
transforms implemented in different languages. For
example, Java requires a class name and builder methods.
Parameters of the constructor have to be ordered.
Python usually uses keyword arguments where the ordering
doesn't matter. Go meanwhile does not have a concept of
classes.
Regarding vulnerabilities, I think one solution might be
to introduce some sort of an authentication mechanism for
ExpansionServices so that expansion requests can be
properly authenticated. Currently we only use expansion
services that are local processes so I think this can be
left out of this proposal but this is something we should
add to properly support remote expansion services.
Thanks,
Cham
[1]
https://docs.google.com/document/d/1ECXSWicE31K-vSxdb4qL6UcmovOAWvE-ZHFT3NTM654/edit?disco=AAAANjeR9CU
<https://docs.google.com/document/d/1ECXSWicE31K-vSxdb4qL6UcmovOAWvE-ZHFT3NTM654/edit?disco=AAAANjeR9CU>
On Tue, Aug 17, 2021 at 11:03 AM Chamikara Jayalath
<chamik...@google.com <mailto:chamik...@google.com>>
wrote:
Thanks for all the comments in the doc.
I created [1] for tracking and opened up a pull
request for proto and Java updates:
https://github.com/apache/beam/pull/15343
<https://github.com/apache/beam/pull/15343>
Thanks,
Cham
[1]
https://issues.apache.org/jira/browse/BEAM-12769
<https://issues.apache.org/jira/browse/BEAM-12769>
On Tue, Jul 27, 2021 at 6:28 PM Robert Bradshaw
<rober...@google.com <mailto:rober...@google.com>>
wrote:
On Tue, Jul 27, 2021 at 1:31 PM Chamikara
Jayalath <chamik...@google.com
<mailto:chamik...@google.com>> wrote:
>
> On Tue, Jul 27, 2021 at 11:03 AM Andrew
Pilloud <apill...@google.com
<mailto:apill...@google.com>> wrote:
>>
>> Hi Cham,
>>
>> Are you aware of the SchemaIO and
SchemaIOProvider interfaces (and their
design)? One of SchemaIO's goal is putting a
generic interface on IOs so users don't have
to construct wrappers for cross language use.
It looks like this new interface can probably
construct a Java SchemaIO, so it sounds
reasonable to me. (This might be something
worth testing when you implement it.)
>>
>> We are starting to add additional
functionality (support for automatic
optimizations, such as filter and project
push-down). I'm not sure how this is going to
work cross language yet, but we will probably
end up adding metadata needed to reconstruct
the transform to the portability proto.
>
> Went through it a bit and I think the two
designs are complementary. Schema aware IO
will allow some I/O transform authors to allow
easily accessing transforms from a remote SDK
using a SQL query while the current proposal
makes defining/using Java transforms easier
for non-Java programmers. I think both
proposals will help reduce the barrier to
entry for cross-language and will help make
more Java transforms available to other SDKs.
SQL benefits from being able to declare an IO
in textual form.
Cross-language seeks to establish a standard
to describe an IO in a
language-agnostic form. At their core is the
desire to be able to
instantiate an IO based on a name (which is
likely linked to an
implementation via a registrar) and a set of
named parameters of
"basic" type. I would hope that any IO that
offers a generic SchemaIO
interface will be trivially wrappable as an
external transform.
I do agree, however, that this external
transform is more general than
just IOs and transforms accepting/providing
Row types.