Re: Data Load

Anil Rai Wed, 03 May 2017 10:59:10 -0700

Hi Matt,

I quickly developed this and this is how i could do this


DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->ReplaceText->PutSQL->Postgres(onCloud)

The problem is, this will not scale for huge volumes. Any thoughts?

Regards
Anil


On Tue, May 2, 2017 at 12:07 PM, Matt Burgess <[email protected]> wrote:

> Yes that sounds like your best bet, assuming you have the "Maximum
> Value Column" present in the table you want to migrate.  Then a flow
> might look like:
>
> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
>
> In this flow the target tables would need to be created beforehand.
> You might be able to do that with pg_dump or with some fancy SQL that
> you could send to PutSQL in a separate (do-ahead) flow [1].  For
> multiple tables, you will need one QueryDatabaseTable for each table;
> depending on the number of tables and the latency for getting/putting
> rows, you may be able to share the downstream processing. If that
> creates a bottleneck, you may want a copy of the above flow for each
> table.  This is drastically improved in NiFi 1.2.0, as you can use
> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
> ExecuteSQL to perform the migration in parallel across a NiFi cluster.
>
> Regards,
> Matt
>
> [1] https://serverfault.com/questions/231952/is-there-a-
> mysql-equivalent-of-show-create-table-in-postgres
>
>
> On Tue, May 2, 2017 at 11:18 AM, Anil Rai <[email protected]> wrote:
> > Thanks Matt for the quick reply. We are using nifi 1.0 release as of now.
> > It's a postgres DB on both sides (on prem and on cloud)
> > and yes incremental load is what i am looking for.....
> > so with that, you recommend # 2 option?
> >
> > On Tue, May 2, 2017 at 11:00 AM, Matt Burgess <[email protected]>
> wrote:
> >
> >> Anil,
> >>
> >> Is this a "one-time" migration, meaning you would take the on-prem
> >> tables and put them on the cloud DB just once? Or would this be an
> >> incremental operation, where you do the initial move and then take any
> >> "new" rows from the source and apply them to the target?  For the
> >> latter, there are a couple of options:
> >>
> >> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
> >> [processors] -> PutSQL after perhaps deleting your target
> >> DB/tables/etc.  This could be time-consuming and expensive. The
> >> processors in-between probably include ConvertAvroToJSON and
> >> ConvertJSONToSQL.
> >> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
> >> the source data. For this your table would need a column whose values
> >> always increase, that column would comprise the value of the "Maximum
> >> Value Column" property in the aforementioned processors' configuration
> >> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
> >> for each table in your DB.
> >>
> >> In addition to these current solutions, as of the upcoming NiFi 1.2.0
> >> release, you have the following options:
> >> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
> >> processor to get binary log events flowing through various processors
> >> into PutDatabaseRecord to place them at the source. This pattern is
> >> true Change Data Capture (CDC) versus the other two options above.
> >> 4) Option #2 will be improved such that GenerateTableFetch will accept
> >> incoming flow files, so you can use ListDatabaseTables ->
> >> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
> >> flow.
> >>
> >> If this is a one-time migration, a data flow tool might not be the
> >> best choice, you could consider something like Flyway [1] instead.
> >>
> >> Regards,
> >> Matt
> >>
> >> [1] https://flywaydb.org/documentation/command/migrate
> >>
> >> On Tue, May 2, 2017 at 10:41 AM, Anil Rai <[email protected]>
> wrote:
> >> > I have a simple use case.
> >> >
> >> > DB (On Premise) and DB (On Cloud).
> >> >
> >> > I want to use nifi to extract data from on prem DB (huge volumes) and
> >> > insert into the same table structure that is hosted on cloud.
> >> >
> >> > I could use ExecuteSQL on both sides of the fence (to extract from on
> >> prem
> >> > and insert onto cloud). What processors are needed in between (if at
> >> all)?
> >> > As i am not doing any transformations at all....it is just extract and
> >> load
> >> > use case
> >>
>

Re: Data Load

Reply via email to