Re: Processor for Delta Lake / Real Time Data Warehouse

Mike Thomsen Sun, 02 Feb 2020 06:19:16 -0800

For the NiFi side, I've set things up locally using a Parquet setting to
write the files locally to disk and then had standalone Spark read those
into a Delta Lake. It was very easy to do using basic info from public
tutorials on Parquet, Delta, SparkSQL, etc. I don't have any code to share
right now, but this is a rough sketch for NiFi:


1. Write an Avro schema for a simple data set.
2. Generate some sample data.
3. Load the data and put it through the Parquet processors w/ a hadoop
settings file that lets them write to local disk or S3.
4. Have a Spark process in place that will read the input Parquet folder
and merge it into the Delta table according to your preferences.

Beyond that, I think you'll need to go over to the Delta mailing
list/Google Group and ask them about best practices like going from bronze
-> silver -> gold.

On Sun, Feb 2, 2020 at 5:03 AM Martin Ebert <[email protected]> wrote:

> Can anyone share a concrete minimal example with Nifi?
>
> Scenario:
> I have two tables in parquet format in S3. This is my source.
> Step #1 Bronze Layer / Raw
> Now I would like to consume these tables in step 1 (ListS3 + FetchS3) and
> save each as a delta table in S3 (new delta processor). I use for example
> overwrite for each execution. I expect that something like partitions and
> the transaction logs are created correctly.
> Step 2 Silver Layer / Prep
> Now I read the two delta tables (new delta processor) from the previous
> step as batch or stream, filter my data (configurable in the delta
> processor) and save the new delta tables in my Silver Layer.
> Step 3 Gold Layer / Analyze
> Now I read the two delta tables from the previous step as batch or stream
> and "connect" my data (join, merge, delete, ...). The results will end up
> as delta in S3 or as single parquet file (depends on the next steps).
> This is how I see the current process.
>
> There are two things that are not quite clear to me:
> 1) How exactly this looks like: "Delta tables can be registered in the
> metastore and you can enable Delta SQL commands with Spark".
> 2) How do I run the SQL scripts on Databricks? Simply by using the JDBC
> connection to Databricks?
>
> Joey Frazee <[email protected]> schrieb am So., 2. Feb. 2020,
> 02:47:
>
> > Delta tables can be registered in the metastore and you can enable Delta
> > SQL commands with Spark.
> >
> > And there are integrations with other engines for the read path.
> >
> > This isn’t to say a native integration isn’t needed; it is. But there are
> > some options today.
> >
> > -joey
> > On Feb 1, 2020, 7:13 PM -0600, Mike Thomsen <[email protected]>,
> > wrote:
> > > > But the specific use case you described can be done today using
> Nifi’s
> > > out of the box SQL processors and JDBC/ODBC with the all of mentioned
> > > databases/engines.
> > >
> > > I'm not aware of any direct JDBC route that would allow NiFi to use the
> > SQL
> > > processors to hit Delta Lake. Would like to be proved wrong here (heck,
> > it
> > > would make some of our use cases easier!) but AFAIK you have to have
> NiFi
> > > build Parquet and push to S3 or HDFS so Spark can do the
> transformation.
> > >
> > > On Sat, Feb 1, 2020 at 4:14 PM Joey Frazee <[email protected]
> > .invalid>
> > > wrote:
> > >
> > > > Martin, I’ve been thinking about this one for a while but I think it
> > needs
> > > > to be considered in the context of transactional table formats in
> > general;
> > > > i.e., the incubating Apache Hudi and Apache Iceberg too.
> > > >
> > > > There are things that are inconvenient for NiFi to do with these
> table
> > > > formats.
> > > >
> > > > But the specific use case you described can be done today using
> Nifi’s
> > out
> > > > of the box SQL processors and JDBC/ODBC with the all of mentioned
> > > > databases/engines.
> > > >
> > > > -joey
> > > > On Feb 1, 2020, 6:57 AM -0600, Martin Ebert <[email protected]>,
> > wrote:
> > > > > Hi community,
> > > > > how can we drive this topic forward? Jira is created:
> > > > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-6976
> > > > >
> > > > > "A table in Delta Lake is both a batch table, as well as a
> streaming
> > > > source
> > > > > and sink. Streaming data ingest, batch historic backfill, and
> > interactive
> > > > > queries all just work out of the box." (Delta.io)
> > > > >
> > > > > This is the decisive argument for me. A very impressive
> technological
> > > > > milestone that is just crying out to be implemented in Nifi. You
> > find all
> > > > > details in the video here: https://youtu.be/VLd_qOrKrTI
> > > > >
> > > > > Delta Lake is related to Databricks, Athena and Presto. In our case
> > it
> > > > > would be great to extract data from a database or any other source
> > (can
> > > > be
> > > > > streaming) and send this data or stream to our Databricks cluster.
> > > > >
> > > > > I imagine it just like in the video. You have a Delta Lake
> processor,
> > > > where
> > > > > you can define to which Databricks cluster the data should go and
> > which
> > > > > Delta Lake operation (upsert, merge, delete, ...) should happen
> with
> > the
> > > > > data. That means Databricks is only the executing component and I
> > don't
> > > > > have to code in Databricks in notebooks anymore. I also find the
> > > > > possibility to request an extra cluster with the processor cool.
> > > > >
> > > > > Being able to to the same with Athena and Presto would be a dream!
> > > >
> >
>

Re: Processor for Delta Lake / Real Time Data Warehouse

Reply via email to