Hi Kenn, It's fundamentally a question I asked myself a few times when I see questions on this very mailing list. Automatic column detection, weird data sources... all these things have already been solved in Kettle long time ago.
The core Kettle API for a transformation step (as it is called) follows similar logic to Apache Beam Transform in the sense that a step reads rows of data and writes them. Things like side-loading are also supported but also a bunch of other options like directing rows to specific other target steps (switch/case) or reading from specific source steps (Merge join specifying left/right). These similarities have made it "fairly easy" to wrap them up in Transform/DoFn and ultimately convert Kettle transformations into Beam pipelines. I think we can make it easier in the future by making some changes to the core API of Kettle itself. The API has been working fine for over 15 years but and it's doable now but I think there are things we learned along the way and there are more options right now. Before we do something like that however we (the core Kettle community) are contemplating making Kettle itself an Apache incubator project. Kettle is pretty widely used in large organisations across the globe and the Apache cooperation model is something we think would work better than what is currently in place for all sorts of reasons I won't go into as I'm trying to phrase this as diplomatically as possible. If anyone has suggestions on this subject, please reach out to me. But to the core of your question: I do see a lot of value in a reverse wrap of a generic IO wrapper around a bunch of Kettle input and output step plugins. Instead of converting Kettle metadata into the Beam API you would convert Beam properties to Kettle metadata in some smart way, probably simply by sub-classing some Kettle metadata beans to implement Input or Output interfaces. What would be an issue is that any data integration running off of metadata (any ETL tool really) requires input and output formats to be predictable. This means that there needs to be a certain contract as to what goes in and out of steps in any shape or form. Because of this, the current pipelines we build pass around data in the form of a KettleRow (PCollection<KettleRow>). KettleRow is just an Object[] wrapper and you get a description of what's in there. If folks can live with that they can easily convert this data to other formats. All the best, Matt Op ma 25 feb. 2019 00:25 schreef Kenneth Knowles <[email protected]>: > Nice work! I'm impressed at how quickly this has come together. > > Did you build a generic adapter for using Kettle connectors in Beam? (I > don't know what a Kettle connector API looks like) > > It would be cool to make these connectors more broadly available to Beam > users, though maybe not optimal for parallel big data reads. > > Kenn > > On Sun, Feb 24, 2019 at 1:13 PM Matt Casters <[email protected]> > wrote: > >> >> Folks, it's not my habit but playing around with running Kettle >> transformations on Flink w/ Beam was so cool I had to blog about it. >> >> >> http://sandbox.kettle.be/wordpress/index.php/2019/02/24/kettle-beam-update-0-5-0/ >> >> Allow me to again extend my thanks to all the developers involved. Some >> really cool things are happening right now. >> Version 0.5.0 of Kettle Beam now supports all Kettle steps including >> third party connectors like SalesForce, SAP, Neo4j and so on. Obviously >> they don't always make sense in a big data context but side-loading the >> data for in-memory lookup and so on can indeed make a lot of sense in a lot >> of scenarios. >> For the batched output I also managed to get performance on-par with >> expectations, specifically for Neo4j since I work for the company after >> all. I really appreciate all the help I got so far getting to this point. >> In a record time we've gone from conceptual work to something we can >> consider to be stable. Apache Beam has really made a huge difference. >> >> Cheers, >> >> Matt >> --- >> Matt Casters <m <[email protected]>[email protected]> >> Senior Solution Architect, Kettle Project Founder >> >> >>
