After Beam Summit EU I was curious about the External transform. I was
interested on the scenario of using it to call python code in the
middle of a Java pipeline. This is a potentially useful scenario for
example to evaluate models from python ML frameworks on Java
pipelines. In my example I did a transform to classify elements in a
simple Python ParDo and tried to connect it via the Java External
transform.

I found that the ExternalTransform code was added into
`runners/core-construction-java` as part of BEAM-6747 [1]. However
this code is not exposed currently as part of the Beam Java SDK, so
end users won’t be able to find it easily. I found this weird and
thought well it will be as simple as to move it into the Java SDK and
voila!

But of course this could not be so easy because this transform calls
the Expansion service via gRPC and Java SDK does not have (and
probably should not have) gRPC in its dependencies.
So my second reflex was to add it into Java SDK and translate it a
generic expansion all the runners, but this may not make sense because
the External transform is not part of the runner translation since
this is part of the Pipeline construction process (as pointed to me by
Max in a slack discussion).

So the question is: How do you think this should be exposed to the end users?

1. Should we add gRPC with all its deps to SDKs Java core? (this of
course it is not nice because we will leak our vendored gRPC and
friends into users classpath).
2. Should we do the dynamic loading of classes only an runtime if the
transform is used to avoid the big extra compile dependency (and add
runners/core-construction-java) as a runtime dependency.
3. Should we create a ‘shim’ module to hide the gRPC dependency and
load the gRPC classes dynamically on it when the External transform is
part of the pipeline.
4. Should we pack it as an extension (with the same issue of needing
to leak the dependencies, but with less impact for users who do not
use External) ?
5. Other?

The ‘purist’ me thinks we should have External in sdks/java/core but
maybe it is better not to. Any other opinions or ideas?

Thanks,
Ismaël

[1] https://issues.apache.org/jira/browse/BEAM-6747

Reply via email to