On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis <[email protected]> wrote:
>
> On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>> <[email protected]> wrote:
>> >
>> > > Given the increasing importance of multi language pipelines, it does
>> > > seem that we should expand the capabilities of the DirectRunner or just
>> > > go all in on FlinkRunner for testing and local / small scale development
>> >
>> > +1 - annecdotally I've found local testing of multi-language pipelines to
>> > be tricky, and have had multiple conversations with others who have run
>> > into similar challenges in multiple contexts (both users and people
>> > working on the project).
>>
>> I generally do all my testing against the Python runner which works
>> well. This is, of course, more natural for Python pipelines using
>> other languages, but when I was working on typescript which uses
>> cross-language even more heavily I just made it auto-start the python
>> runner just like the expansion services are auto-started which works
>> quite well. (The auto-started runner is just a plain-old portable
>> runner speaking the runner API, so no additional support is required
>> on the source side once it's started. And if you're already trying to
>> use dataframes and/or ML, you need to have Python available anyway.)
>>
>> We could consider bundling it as a docker image to reduce the required
>> dependency set, but we'd have to solve the docker-in-docker issue to
>> do that.
>>
>> I really think it's important to make cross-language a first-class
>> citizen--the end use should not care most of the time whether the
>> pipelines they use are native or not.
>
>
> Thanks! That's helpful. In this case getting the Python runner to auto-start
> sounds like the most straightforward option for testing. After all it's
> explicitly to provide Python initiated from Java so Python is already going
> to be around and running (and in fact the test auto-starts the Python
> expansion service already to get the graph in the first place) and the deps
> are already going to be there.
Yep.
> I'm personally on the fence about Docker in these sorts of situations. Yes,
> it makes life easier for the most part but gets complicated quickly. It's
> also not an option for everyone.
For sure. I think it'd be good to have various alternative packaging
of expansion services as different people will have different setups
(e.g. a Crostini Go developer is more likely to have docker than java,
but it's probably just the opposite for a java developer on windows).
This is what I did for the yaml thing. Note that nominally docker is
required for running a cross-language pipeline, so that makes it a
more natural option there. (Technically, at least for development, you
can have the host SDK process vend itself as a worker in LOOPBACK
mode, and if you pass the directEmbedDockerPython=true option to the
portable python runner it will inline the Python operations rather
than firing up a docker worker for those (assuming, of course, the
versions match.)
> I'll give things a shot and report back (if you have an example of
> auto-starting the Python runner that'd be cool too---if I get inspired I
> might try to add that to the Python extensions in Java since right now they
> don't actually appear to be exercising the runner itself based on the TODOs)
In typescript the runner is started up as
PythonService.forModule("apache_beam.runners.portability.local_job_service_main",
["--port", "{{PORT}}"])
which is very similar to how the expansion service is started up
PythonService.forModule("apache_beam.runners.portability.expansion_service_main",
["--fully_qualified_name_glob=*", "--port", "{{PORT}}"])
Here {{PORT}} is auto-populated and can be retrieved to instantiate
the GRPC connection (or, in your case, passed to the portable runner
as the endpoint).
Java is very similar, one does
PythonService service =
new
PythonService("apache_beam.runners.portability.expansion_service_main",
...);
AutoCloseable running = service.start();
...
which should be easily adaptable to starting up a runner. On first use
this service automatically creates a virtual environment with Beam
(and other dependencies) installed. (I don't know what the analogue is
for Go, but it shouldn't be that different...)
The one difficulty with auto-started service is that the release
artifacts are not necessarily available for a dev repo the same way
they are with a released version. IIRC, we fall back to the previous
release in that case. To compensate, and have faster iteration/easier
testing, one can set an environment variable BEAM_SERVICE_OVERRIDES
where one specifies an existing venv, jar, or address to use for a
specific service, see
https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432
. This works for Python and typescript; I don't remember if I
implemented it for Java and I don't think it's yet in Go.
Hopefully this is enough pointers to get started. It'd be great to get
Java up to snuff.
References:
https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34
https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60
https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466
>> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev
>> > <[email protected]> wrote:
>> >>
>> >> Given the increasing importance of multi language pipelines, it does seem
>> >> that we should expand the capabilities of the DirectRunner or just go all
>> >> in on FlinkRunner for testing and local / small scale development
>> >>
>> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke <[email protected]> wrote:
>> >>>
>> >>> Probably either on Flink, or the Python Portable runner at this juncture.
>> >>>
>> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> I spent some more time adding things to my dbt-for-Beam clone
>> >>>> (https://github.com/apache/beam/pull/24670) and actually made a fair
>> >>>> amount of progress, including starting to add in the profile support so
>> >>>> I can start to run it against real workloads (though at the moment only
>> >>>> the "test" connector is properly configured). More interestingly,
>> >>>> though, is adding in support for Python Dataframe external
>> >>>> transforms... which expands properly, but then (unsurprisingly) hangs
>> >>>> if you try to actually run the pipeline with Java's TestPipeline.
>> >>>>
>> >>>> I was wondering how people go about testing Java/Python hybrid
>> >>>> pipelines locally? The Java<->Python tests don't seem to actually
>> >>>> execute a pipeline, but I was hoping that maybe the direct runner could
>> >>>> be set up properly to do that?
>> >>>>
>> >>>> Best,
>> >>>> B