Thanks for the tips, folks! Took a bit of doing, but I got Java -> Python -> Java working without Docker being involved in the process (getting it working with Docker being involved wasn't so bad... though it didn't do what I wanted with respect to collecting results). Removing Docker appears to let me collect the results back on the Java side via Beam SQL's TestTable, which then lets me inspect the results for test validation purposes.
In case anyone else is feeling similarly foolish, here's what ended up working: https://github.com/byronellis/beam/blob/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/java/org/apache/beam/sdk/extensions/spd/StructuredPipelineExecutionTest.java It ain't pretty, but it gets the job done. Best, B On Wed, Dec 28, 2022 at 10:42 AM Robert Bradshaw <[email protected]> wrote: > On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis <[email protected]> > wrote: > > > > On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw <[email protected]> > wrote: > >> > >> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev > >> <[email protected]> wrote: > >> > > >> > > Given the increasing importance of multi language pipelines, it > does seem that we should expand the capabilities of the DirectRunner or > just go all in on FlinkRunner for testing and local / small scale > development > >> > > >> > +1 - annecdotally I've found local testing of multi-language > pipelines to be tricky, and have had multiple conversations with others who > have run into similar challenges in multiple contexts (both users and > people working on the project). > >> > >> I generally do all my testing against the Python runner which works > >> well. This is, of course, more natural for Python pipelines using > >> other languages, but when I was working on typescript which uses > >> cross-language even more heavily I just made it auto-start the python > >> runner just like the expansion services are auto-started which works > >> quite well. (The auto-started runner is just a plain-old portable > >> runner speaking the runner API, so no additional support is required > >> on the source side once it's started. And if you're already trying to > >> use dataframes and/or ML, you need to have Python available anyway.) > >> > >> We could consider bundling it as a docker image to reduce the required > >> dependency set, but we'd have to solve the docker-in-docker issue to > >> do that. > >> > >> I really think it's important to make cross-language a first-class > >> citizen--the end use should not care most of the time whether the > >> pipelines they use are native or not. > > > > > > Thanks! That's helpful. In this case getting the Python runner to > auto-start sounds like the most straightforward option for testing. After > all it's explicitly to provide Python initiated from Java so Python is > already going to be around and running (and in fact the test auto-starts > the Python expansion service already to get the graph in the first place) > and the deps are already going to be there. > > Yep. > > > I'm personally on the fence about Docker in these sorts of situations. > Yes, it makes life easier for the most part but gets complicated quickly. > It's also not an option for everyone. > > For sure. I think it'd be good to have various alternative packaging > of expansion services as different people will have different setups > (e.g. a Crostini Go developer is more likely to have docker than java, > but it's probably just the opposite for a java developer on windows). > This is what I did for the yaml thing. Note that nominally docker is > required for running a cross-language pipeline, so that makes it a > more natural option there. (Technically, at least for development, you > can have the host SDK process vend itself as a worker in LOOPBACK > mode, and if you pass the directEmbedDockerPython=true option to the > portable python runner it will inline the Python operations rather > than firing up a docker worker for those (assuming, of course, the > versions match.) > > > I'll give things a shot and report back (if you have an example of > auto-starting the Python runner that'd be cool too---if I get inspired I > might try to add that to the Python extensions in Java since right now they > don't actually appear to be exercising the runner itself based on the TODOs) > > In typescript the runner is started up as > > > PythonService.forModule("apache_beam.runners.portability.local_job_service_main", > ["--port", "{{PORT}}"]) > > which is very similar to how the expansion service is started up > > > > PythonService.forModule("apache_beam.runners.portability.expansion_service_main", > ["--fully_qualified_name_glob=*", "--port", "{{PORT}}"]) > > Here {{PORT}} is auto-populated and can be retrieved to instantiate > the GRPC connection (or, in your case, passed to the portable runner > as the endpoint). > > Java is very similar, one does > > PythonService service = > new > PythonService("apache_beam.runners.portability.expansion_service_main", > ...); > AutoCloseable running = service.start(); > ... > > which should be easily adaptable to starting up a runner. On first use > this service automatically creates a virtual environment with Beam > (and other dependencies) installed. (I don't know what the analogue is > for Go, but it shouldn't be that different...) > > The one difficulty with auto-started service is that the release > artifacts are not necessarily available for a dev repo the same way > they are with a released version. IIRC, we fall back to the previous > release in that case. To compensate, and have faster iteration/easier > testing, one can set an environment variable BEAM_SERVICE_OVERRIDES > where one specifies an existing venv, jar, or address to use for a > specific service, see > > https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432 > . This works for Python and typescript; I don't remember if I > implemented it for Java and I don't think it's yet in Go. > > Hopefully this is enough pointers to get started. It'd be great to get > Java up to snuff. > > References: > > https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34 > > https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60 > > https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466 > > >> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev < > [email protected]> wrote: > >> >> > >> >> Given the increasing importance of multi language pipelines, it does > seem that we should expand the capabilities of the DirectRunner or just go > all in on FlinkRunner for testing and local / small scale development > >> >> > >> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke <[email protected]> > wrote: > >> >>> > >> >>> Probably either on Flink, or the Python Portable runner at this > juncture. > >> >>> > >> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev < > [email protected]> wrote: > >> >>>> > >> >>>> Hi all, > >> >>>> > >> >>>> I spent some more time adding things to my dbt-for-Beam clone ( > https://github.com/apache/beam/pull/24670) and actually made a fair > amount of progress, including starting to add in the profile support so I > can start to run it against real workloads (though at the moment only the > "test" connector is properly configured). More interestingly, though, is > adding in support for Python Dataframe external transforms... which expands > properly, but then (unsurprisingly) hangs if you try to actually run the > pipeline with Java's TestPipeline. > >> >>>> > >> >>>> I was wondering how people go about testing Java/Python hybrid > pipelines locally? The Java<->Python tests don't seem to actually execute a > pipeline, but I was hoping that maybe the direct runner could be set up > properly to do that? > >> >>>> > >> >>>> Best, > >> >>>> B >
