It's a great idea for sure and I'll create a diagram for Beam and Hop
separately.  But ... this mail was mistakenly sent to the wrong dev group
:-|  Apologies for that.

But while I'm on the topic, the Beam development model in general is simple
in the sense that it forces a single classpath.
When dealing with complicated ETL scenarios it causes all sorts of
drawbacks with library version collisions, minimal packaging ideals,
containerization, ...
There are currently 194 pre-packaged transforms in Apache Hop, each with
their own dependencies and libraries.  It's great to give people a lot of
choice and options but we're looking for ways to manage the assembly phase
and provide ways to validate the deployed software when it reaches
production status.

Anyway, I see a lot of people ending up in a scenario where you only know a
pipeline is breaking on the software level when a pipeline actually
breaks.  Solving that in the general sense with extra metadata and a more
intelligent fat-jar builder is the ultimate goal I think.

Thanks,

Matt


On Tue, Feb 22, 2022 at 4:27 PM Kerry Donny-Clark <[email protected]>
wrote:

> Thanks Matt. These are important issues, and I agree that it's well worth
> figuring out a solution. Especially the libraries being intelligently split
> per runner, and doing it in a way that gives more fine control over the
> build. A design sketch shared to this list is probably a good start. Can
> you write something up and share?
> Kerry
>
> On Tue, Feb 22, 2022 at 9:58 AM Matt Casters <[email protected]>
> wrote:
>
>> Hello Hops,
>>
>> I've been struggling with a few classpath related issues:
>>
>> * Plugin data types are only accessible from the plugin they were
>> introduced with (Avro, Graph)
>> * It's not possible in a safe way for another plugin to add plugins
>> (Beam) to the root class loader
>>
>> This has been causing all sorts of class loader problems which are
>> typically resolved by either shoving everything in the root classloader
>> (Avro data type) or by having large blobs for a plugin (engines/beam).
>>
>> In the ideal scenario we'd have for example all the Kafka plugins in one
>> plugin with all the dependencies nicely grouped together in one
>> plugins/transforms/kafka folder and this would include all the Beam related
>> code as well.  The caveats being that we can't ship the Beam libs in every
>> plugin and that it should be easy to get rid of functionality.
>>
>> What I would love to do is come up with an alternative way of assembling
>> and building our software.  For this to happen I think it should be
>> possible for any external "plugin" project to register classes in the root
>> class loader.  To make this happen there are various options like for
>> example having an extra folder like "libroot" in the plugin folder.  It
>> would have to act as if the libraries in there belonged to the root
>> classloader and our scripts would need to be able to pick this up.
>>
>> I would also love to see some extra metadata around the libraries that we
>> have assembled in folders.  For example we'll want to create a smarter "fat
>> jar" builder which knows that Spark, Flink and Dataflow are different
>> platforms and that we don't need all libraries from all 3 platforms to run
>> something on either of those.  Perhaps by splitting up libraries in a more
>> fine-grained manner we can also add a small JSON file like
>> "library-metadata.json" containing some metadata that can then be picked up
>> by the fat jar builder?
>> In the plugins/engines/beam case you'd have folders: lib, libroot,
>> libspark, libflink, libdataflow, ...
>> The Kafka transforms code could move to plugins/transforms/kafka and so
>> on.
>>
>> Let's brainstorm around the possibilities and the possible problems to
>> come up with the next architecture for Hop.
>>
>> Cheers,
>> Matt
>>
>>
>>

-- 
Neo4j Chief Solutions Architect
*✉   *[email protected]

Reply via email to