Re: Processor for Delta Lake / Real Time Data Warehouse

Joey Frazee Sat, 01 Feb 2020 17:48:29 -0800

Delta tables can be registered in the metastore and you can enable Delta SQL 
commands with Spark.


And there are integrations with other engines for the read path.

This isn’t to say a native integration isn’t needed; it is. But there are some 
options today.

-joey
On Feb 1, 2020, 7:13 PM -0600, Mike Thomsen <[email protected]>, wrote:
> > But the specific use case you described can be done today using Nifi’s
> out of the box SQL processors and JDBC/ODBC with the all of mentioned
> databases/engines.
>
> I'm not aware of any direct JDBC route that would allow NiFi to use the SQL
> processors to hit Delta Lake. Would like to be proved wrong here (heck, it
> would make some of our use cases easier!) but AFAIK you have to have NiFi
> build Parquet and push to S3 or HDFS so Spark can do the transformation.
>
> On Sat, Feb 1, 2020 at 4:14 PM Joey Frazee <[email protected]>
> wrote:
>
> > Martin, I’ve been thinking about this one for a while but I think it needs
> > to be considered in the context of transactional table formats in general;
> > i.e., the incubating Apache Hudi and Apache Iceberg too.
> >
> > There are things that are inconvenient for NiFi to do with these table
> > formats.
> >
> > But the specific use case you described can be done today using Nifi’s out
> > of the box SQL processors and JDBC/ODBC with the all of mentioned
> > databases/engines.
> >
> > -joey
> > On Feb 1, 2020, 6:57 AM -0600, Martin Ebert <[email protected]>, wrote:
> > > Hi community,
> > > how can we drive this topic forward? Jira is created:
> > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-6976
> > >
> > > "A table in Delta Lake is both a batch table, as well as a streaming
> > source
> > > and sink. Streaming data ingest, batch historic backfill, and interactive
> > > queries all just work out of the box." (Delta.io)
> > >
> > > This is the decisive argument for me. A very impressive technological
> > > milestone that is just crying out to be implemented in Nifi. You find all
> > > details in the video here: https://youtu.be/VLd_qOrKrTI
> > >
> > > Delta Lake is related to Databricks, Athena and Presto. In our case it
> > > would be great to extract data from a database or any other source (can
> > be
> > > streaming) and send this data or stream to our Databricks cluster.
> > >
> > > I imagine it just like in the video. You have a Delta Lake processor,
> > where
> > > you can define to which Databricks cluster the data should go and which
> > > Delta Lake operation (upsert, merge, delete, ...) should happen with the
> > > data. That means Databricks is only the executing component and I don't
> > > have to code in Databricks in notebooks anymore. I also find the
> > > possibility to request an extra cluster with the processor cool.
> > >
> > > Being able to to the same with Athena and Presto would be a dream!
> >

Re: Processor for Delta Lake / Real Time Data Warehouse

Reply via email to