It's a great idea for sure and I'll create a diagram for Beam and Hop separately. But ... this mail was mistakenly sent to the wrong dev group :-| Apologies for that.
But while I'm on the topic, the Beam development model in general is simple in the sense that it forces a single classpath. When dealing with complicated ETL scenarios it causes all sorts of drawbacks with library version collisions, minimal packaging ideals, containerization, ... There are currently 194 pre-packaged transforms in Apache Hop, each with their own dependencies and libraries. It's great to give people a lot of choice and options but we're looking for ways to manage the assembly phase and provide ways to validate the deployed software when it reaches production status. Anyway, I see a lot of people ending up in a scenario where you only know a pipeline is breaking on the software level when a pipeline actually breaks. Solving that in the general sense with extra metadata and a more intelligent fat-jar builder is the ultimate goal I think. Thanks, Matt On Tue, Feb 22, 2022 at 4:27 PM Kerry Donny-Clark <[email protected]> wrote: > Thanks Matt. These are important issues, and I agree that it's well worth > figuring out a solution. Especially the libraries being intelligently split > per runner, and doing it in a way that gives more fine control over the > build. A design sketch shared to this list is probably a good start. Can > you write something up and share? > Kerry > > On Tue, Feb 22, 2022 at 9:58 AM Matt Casters <[email protected]> > wrote: > >> Hello Hops, >> >> I've been struggling with a few classpath related issues: >> >> * Plugin data types are only accessible from the plugin they were >> introduced with (Avro, Graph) >> * It's not possible in a safe way for another plugin to add plugins >> (Beam) to the root class loader >> >> This has been causing all sorts of class loader problems which are >> typically resolved by either shoving everything in the root classloader >> (Avro data type) or by having large blobs for a plugin (engines/beam). >> >> In the ideal scenario we'd have for example all the Kafka plugins in one >> plugin with all the dependencies nicely grouped together in one >> plugins/transforms/kafka folder and this would include all the Beam related >> code as well. The caveats being that we can't ship the Beam libs in every >> plugin and that it should be easy to get rid of functionality. >> >> What I would love to do is come up with an alternative way of assembling >> and building our software. For this to happen I think it should be >> possible for any external "plugin" project to register classes in the root >> class loader. To make this happen there are various options like for >> example having an extra folder like "libroot" in the plugin folder. It >> would have to act as if the libraries in there belonged to the root >> classloader and our scripts would need to be able to pick this up. >> >> I would also love to see some extra metadata around the libraries that we >> have assembled in folders. For example we'll want to create a smarter "fat >> jar" builder which knows that Spark, Flink and Dataflow are different >> platforms and that we don't need all libraries from all 3 platforms to run >> something on either of those. Perhaps by splitting up libraries in a more >> fine-grained manner we can also add a small JSON file like >> "library-metadata.json" containing some metadata that can then be picked up >> by the fat jar builder? >> In the plugins/engines/beam case you'd have folders: lib, libroot, >> libspark, libflink, libdataflow, ... >> The Kafka transforms code could move to plugins/transforms/kafka and so >> on. >> >> Let's brainstorm around the possibilities and the possible problems to >> come up with the next architecture for Hop. >> >> Cheers, >> Matt >> >> >> -- Neo4j Chief Solutions Architect *✉ *[email protected]
