Re: Testing Multilanguage Pipelines?

Robert Bradshaw via dev Wed, 28 Dec 2022 10:42:53 -0800

On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis <[email protected]> wrote:
>
> On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>> <[email protected]> wrote:
>> >
>> > > Given the increasing importance of multi language pipelines, it does 
>> > > seem that we should expand the capabilities of the DirectRunner or just 
>> > > go all in on FlinkRunner for testing and local / small scale development
>> >
>> > +1 - annecdotally I've found local testing of multi-language pipelines to 
>> > be tricky, and have had multiple conversations with others who have run 
>> > into similar challenges in multiple contexts (both users and people 
>> > working on the project).
>>
>> I generally do all my testing against the Python runner which works
>> well. This is, of course, more natural for Python pipelines using
>> other languages, but when I was working on typescript which uses
>> cross-language even more heavily I just made it auto-start the python
>> runner just like the expansion services are auto-started which works
>> quite well. (The auto-started runner is just a plain-old portable
>> runner speaking the runner API, so no additional support is required
>> on the source side once it's started. And if you're already trying to
>> use dataframes and/or ML, you need to have Python available anyway.)
>>
>> We could consider bundling it as a docker image to reduce the required
>> dependency set, but we'd have to solve the docker-in-docker issue to
>> do that.
>>
>> I really think it's important to make cross-language a first-class
>> citizen--the end use should not care most of the time whether the
>> pipelines they use are native or not.
>
>
> Thanks! That's helpful. In this case getting the Python runner to auto-start 
> sounds like the most straightforward option for testing. After all it's 
> explicitly to provide Python initiated from Java so Python is already going 
> to be around and running (and in fact the test auto-starts the Python 
> expansion service already to get the graph in the first place) and the deps 
> are already going to be there.

Yep.

> I'm personally on the fence about Docker in these sorts of situations. Yes, 
> it makes life easier for the most part but gets complicated quickly. It's 
> also not an option for everyone.

For sure. I think it'd be good to have various alternative packaging
of expansion services as different people will have different setups
(e.g. a Crostini Go developer is more likely to have docker than java,
but it's probably just the opposite for a java developer on windows).
This is what I did for the yaml thing. Note that nominally docker is
required for running a cross-language pipeline, so that makes it a
more natural option there. (Technically, at least for development, you
can have the host SDK process vend itself as a worker in LOOPBACK
mode, and if you pass the directEmbedDockerPython=true option to the
portable python runner it will inline the Python operations rather
than firing up a docker worker for those (assuming, of course, the
versions match.)

> I'll give things a shot and report back (if you have an example of 
> auto-starting the Python runner that'd be cool too---if I get inspired I 
> might try to add that to the Python extensions in Java since right now they 
> don't actually appear to be exercising the runner itself based on the TODOs)

In typescript the runner is started up as

PythonService.forModule("apache_beam.runners.portability.local_job_service_main",
["--port", "{{PORT}}"])

which is very similar to how the expansion service is started up

PythonService.forModule("apache_beam.runners.portability.expansion_service_main",
["--fully_qualified_name_glob=*", "--port", "{{PORT}}"])

Here {{PORT}} is auto-populated and can be retrieved to instantiate
the GRPC connection (or, in your case, passed to the portable runner
as the endpoint).

Java is very similar, one does

    PythonService service =
        new 
PythonService("apache_beam.runners.portability.expansion_service_main",
...);
    AutoCloseable running = service.start();
    ...

which should be easily adaptable to starting up a runner. On first use
this service automatically creates a virtual environment with Beam
(and other dependencies) installed. (I don't know what the analogue is
for Go, but it shouldn't be that different...)

The one difficulty with auto-started service is that the release
artifacts are not necessarily available for a dev repo the same way
they are with a released version. IIRC, we fall back to the previous
release in that case. To compensate, and have faster iteration/easier
testing, one can set an environment variable BEAM_SERVICE_OVERRIDES
where one specifies an existing venv, jar, or address to use for a
specific service, see
https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432
. This works for Python and typescript; I don't remember if I
implemented it for Java and I don't think it's yet in Go.

Hopefully this is enough pointers to get started. It'd be great to get
Java up to snuff.

References:
https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34
https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60
https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466

>> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev 
>> > <[email protected]> wrote:
>> >>
>> >> Given the increasing importance of multi language pipelines, it does seem 
>> >> that we should expand the capabilities of the DirectRunner or just go all 
>> >> in on FlinkRunner for testing and local / small scale development
>> >>
>> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke <[email protected]> wrote:
>> >>>
>> >>> Probably either on Flink, or the Python Portable runner at this juncture.
>> >>>
>> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev <[email protected]> 
>> >>> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> I spent some more time adding things to my dbt-for-Beam clone 
>> >>>> (https://github.com/apache/beam/pull/24670) and actually made a fair 
>> >>>> amount of progress, including starting to add in the profile support so 
>> >>>> I can start to run it against real workloads (though at the moment only 
>> >>>> the "test" connector is properly configured). More interestingly, 
>> >>>> though, is adding in support for Python Dataframe external 
>> >>>> transforms... which expands properly, but then (unsurprisingly) hangs 
>> >>>> if you try to actually run the pipeline with Java's TestPipeline.
>> >>>>
>> >>>> I was wondering how people go about testing Java/Python hybrid 
>> >>>> pipelines locally? The Java<->Python tests don't seem to actually 
>> >>>> execute a pipeline, but I was hoping that maybe the direct runner could 
>> >>>> be set up properly to do that?
>> >>>>
>> >>>> Best,
>> >>>> B

Re: Testing Multilanguage Pipelines?

Reply via email to