You might want to look at this as a starting point instead: https://github.com/delta-io/connectors
I've been involved in some unofficial talks about building such a processor for a little while, but the opportunity hasn't quite lined up yet because Delta is still maturing on creating a public API for non-Spark users to use. In the short term, nothing is stopping you from integrating NiFi and Spark here to get a similar thing going. Just build Parquet files and stream them to a place where Spark + Delta can read them and integrate them into your Delta Lake. On Sat, Feb 1, 2020 at 4:14 PM Joey Frazee <[email protected]> wrote: > Martin, I’ve been thinking about this one for a while but I think it needs > to be considered in the context of transactional table formats in general; > i.e., the incubating Apache Hudi and Apache Iceberg too. > > There are things that are inconvenient for NiFi to do with these table > formats. > > But the specific use case you described can be done today using Nifi’s out > of the box SQL processors and JDBC/ODBC with the all of mentioned > databases/engines. > > -joey > On Feb 1, 2020, 6:57 AM -0600, Martin Ebert <[email protected]>, wrote: > > Hi community, > > how can we drive this topic forward? Jira is created: > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-6976 > > > > "A table in Delta Lake is both a batch table, as well as a streaming > source > > and sink. Streaming data ingest, batch historic backfill, and interactive > > queries all just work out of the box." (Delta.io) > > > > This is the decisive argument for me. A very impressive technological > > milestone that is just crying out to be implemented in Nifi. You find all > > details in the video here: https://youtu.be/VLd_qOrKrTI > > > > Delta Lake is related to Databricks, Athena and Presto. In our case it > > would be great to extract data from a database or any other source (can > be > > streaming) and send this data or stream to our Databricks cluster. > > > > I imagine it just like in the video. You have a Delta Lake processor, > where > > you can define to which Databricks cluster the data should go and which > > Delta Lake operation (upsert, merge, delete, ...) should happen with the > > data. That means Databricks is only the executing component and I don't > > have to code in Databricks in notebooks anymore. I also find the > > possibility to request an extra cluster with the processor cool. > > > > Being able to to the same with Athena and Presto would be a dream! >
