Hi everyone,
We have previously merged support for configuring transforms across
languages. Please see Cham's summary on the discussion [1]. There is
also a design document [2].
Subsequently, we've added wrappers for cross-language transforms to the
Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending
PR [1] for WriteToKafka. All of them utilize Java transforms via
cross-language configuration.
That is all pretty exciting :)
We still have some issues to solve, one being how to stage artifact from
a foreign environment. When we run external transforms which are part of
Beam's core (e.g. GenerateSequence), we have them available in the SDK
Harness. However, when they are not (e.g. KafkaIO) we need to stage the
necessary files.
For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK
Harness which caused dependency problems [4]. Those could be resolved
but the bigger question is how to stage artifacts for external
transforms programmatically?
Heejong has solved this by adding a "--jar_package" option to the Python
SDK to stage Java files [5]. I think that is a better solution than
adding required Jars to the SDK Harness directly, but it is not very
convenient for users.
I've discussed this today with Thomas and we both figured that the
expansion service needs to provide a list of required Jars with the
ExpansionResponse it provides. It's not entirely clear, how we determine
which artifacts are necessary for an external transform. We could just
dump the entire classpath like we do in PipelineResources for Java
pipelines. This provides many unneeded classes but would work.
Do you think it makes sense for the expansion service to provide the
artifacts? Perhaps you have a better idea how to resolve the staging
problem in cross-language pipelines?
Thanks,
Max
[1]
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
[2] https://s.apache.org/beam-cross-language-io
[3] https://github.com/apache/beam/pull/8322#discussion_r276336748
[4] Dependency graph for beam-runners-direct-java:
beam-runners-direct-java -> sdks-java-harness -> beam-sdks-java-io-kafka
-> beam-runners-direct-java ... the cycle continues
Beam-runners-direct-java depends on sdks-java-harness due
to the infamous Universal Local Runner. Beam-sdks-java-io-kafka depends
on beam-runners-direct-java for running tests.
[5] https://github.com/apache/beam/pull/8340