Re: Processor for Delta Lake / Real Time Data Warehouse

Mike Thomsen Sat, 01 Feb 2020 17:11:29 -0800

You might want to look at this as a starting point instead:
https://github.com/delta-io/connectors


I've been involved in some unofficial talks about building such a processor
for a little while, but the opportunity hasn't quite lined up yet because
Delta is still maturing on creating a public API for non-Spark users to use.

In the short term, nothing is stopping you from integrating NiFi and Spark
here to get a similar thing going. Just build Parquet files and stream them
to a place where Spark + Delta can read them and integrate them into your
Delta Lake.

On Sat, Feb 1, 2020 at 4:14 PM Joey Frazee <[email protected]>
wrote:

> Martin, I’ve been thinking about this one for a while but I think it needs
> to be considered in the context of transactional table formats in general;
> i.e., the incubating Apache Hudi and Apache Iceberg too.
>
> There are things that are inconvenient for NiFi to do with these table
> formats.
>
> But the specific use case you described can be done today using Nifi’s out
> of the box SQL processors and JDBC/ODBC with the all of mentioned
> databases/engines.
>
> -joey
> On Feb 1, 2020, 6:57 AM -0600, Martin Ebert <[email protected]>, wrote:
> > Hi community,
> > how can we drive this topic forward? Jira is created:
> > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-6976
> >
> > "A table in Delta Lake is both a batch table, as well as a streaming
> source
> > and sink. Streaming data ingest, batch historic backfill, and interactive
> > queries all just work out of the box." (Delta.io)
> >
> > This is the decisive argument for me. A very impressive technological
> > milestone that is just crying out to be implemented in Nifi. You find all
> > details in the video here: https://youtu.be/VLd_qOrKrTI
> >
> > Delta Lake is related to Databricks, Athena and Presto. In our case it
> > would be great to extract data from a database or any other source (can
> be
> > streaming) and send this data or stream to our Databricks cluster.
> >
> > I imagine it just like in the video. You have a Delta Lake processor,
> where
> > you can define to which Databricks cluster the data should go and which
> > Delta Lake operation (upsert, merge, delete, ...) should happen with the
> > data. That means Databricks is only the executing component and I don't
> > have to code in Databricks in notebooks anymore. I also find the
> > possibility to request an extra cluster with the processor cool.
> >
> > Being able to to the same with Athena and Presto would be a dream!
>

Re: Processor for Delta Lake / Real Time Data Warehouse

Reply via email to