I think it depends how we define "the core" part of the SDK. If we define
the core as only the (abstract) data types which describe BEAM pipeline
model then it would be more sensible to put external transform into a
separate extension module (option 4). Otherwise, option 1 makes sense.

On Wed, Jul 24, 2019 at 11:56 AM Chamikara Jayalath <chamik...@google.com>
wrote:

> The idea of 'ExternalTransform' is to allow users to use transforms in SDK
> X from SDK Y. I think this should be a core part of each SDK and
> corresponding external transforms ([a] for Java, [b] for Python) should be
> released with each SDK. This will also allow us to add core external
> transforms to some of the critical transforms that are not available in
> certain SDKs. So I prefer option (1).
>
> Rebo, I didn't realize there's an external transform in Go SDK. Looking at
> it, seems like it's more of an interface for native transforms implemented
> in each runner, not for cross-language use-cases. Is that correct ? May be
> we can reuse it for latter as well.
>
> Thanks,
> Cham
>
> [a]
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java
> [b]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>
> On Wed, Jul 24, 2019 at 10:25 AM Robert Burke <rob...@frantil.com> wrote:
>
>> Ideas inline.
>>
>> On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> After Beam Summit EU I was curious about the External transform. I was
>>> interested on the scenario of using it to call python code in the
>>> middle of a Java pipeline. This is a potentially useful scenario for
>>> example to evaluate models from python ML frameworks on Java
>>> pipelines. In my example I did a transform to classify elements in a
>>> simple Python ParDo and tried to connect it via the Java External
>>> transform.
>>>
>>> I found that the ExternalTransform code was added into
>>> `runners/core-construction-java` as part of BEAM-6747 [1]. However
>>> this code is not exposed currently as part of the Beam Java SDK, so
>>> end users won’t be able to find it easily. I found this weird and
>>> thought well it will be as simple as to move it into the Java SDK and
>>> voila!
>>>
>>> But of course this could not be so easy because this transform calls
>>> the Expansion service via gRPC and Java SDK does not have (and
>>> probably should not have) gRPC in its dependencies.
>>> So my second reflex was to add it into Java SDK and translate it a
>>> generic expansion all the runners, but this may not make sense because
>>> the External transform is not part of the runner translation since
>>> this is part of the Pipeline construction process (as pointed to me by
>>> Max in a slack discussion).
>>>
>>> So the question is: How do you think this should be exposed to the end
>>> users?
>>>
>>> 1. Should we add gRPC with all its deps to SDKs Java core? (this of
>>> course it is not nice because we will leak our vendored gRPC and
>>> friends into users classpath).
>>>
>> If there's separation between the SDK and the Harness then this makes
>> sense. Otherwise the portable harness depends on GRPC at present, doesn't
>> it? Presently the Go SDK kicks off the harness, and then carries the GRPC
>> dependency (Though that's separable if necessary.)
>>
>>> 2. Should we do the dynamic loading of classes only an runtime if the
>>> transform is used to avoid the big extra compile dependency (and add
>>> runners/core-construction-java) as a runtime dependency.
>>> 3. Should we create a ‘shim’ module to hide the gRPC dependency and
>>> load the gRPC classes dynamically on it when the External transform is
>>> part of the pipeline.
>>> 4. Should we pack it as an extension (with the same issue of needing
>>> to leak the dependencies, but with less impact for users who do not
>>> use External) ?
>>> 5. Other?
>>>
>>> The ‘purist’ me thinks we should have External in sdks/java/core but
>>> maybe it is better not to. Any other opinions or ideas?
>>>
>>
>> The Go SDK supports External in it's core transforms set  However it
>> would be the callers are able to populate the data field however they need
>> to, whether that's some "known" configuration object or something sourced
>> from another service (eg the expansion service). The important part on the
>> other side is that the runner knows what to do with it.
>>
>> The non-portable pubsubio in the Go SDK is an example [1] using External
>> currently. The Dataflow runner recognizes it, and makes the substitution.
>> Eventually once the SDK supports SDF that can generate unbounded
>> PCollections, this will likely be replaced with that kind of
>> implementation, and the the existing "External" version will be moved to
>> part of the Go SDKs Dataflow runner package.
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/pubsubio/pubsubio.go#L65
>>
>>>
>>> Thanks,
>>> Ismaël
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6747
>>>
>>

Reply via email to