I initially thought you were saying that you had 250 Avro schemas that you
had to use, as in 250 different distinct data models.

Maybe someone else has a suggestion on how to do it, but I think this may
just be a fundamental problem of having that many different databases in
MySQL and trying to do CDC with them.

Is there a hard business requirement to segregate data like that or some
factor like pulling from many remote databases that is at play here?

On Thu, Oct 18, 2018 at 6:19 AM ashwin konale <[email protected]>
wrote:

> Hi,
>
> The flow is like this,
>
> MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)
>
> But we have around 250 schemas to pull data from, So with clustering setup,
>
> MysqlCDC_schema1 -> RPG
> MysqlCDC_schema2 -> RPG
> MysqlCDC_schema3 -> RPG and so on
>
> InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)
>
> But MysqlCDC can run only in primary node in the cluster, I will end up
> running all of input processors in single node. This can easily become
> bottleneck with increasing number of schemas we have. Could you suggest me
> any alternative approach to this problem.
>
> On 2018/10/17 21:14:09, Mike Thomsen <[email protected]> wrote:
> > > may have to build some kind of tooling on top of it to
> monitor/provision>
> > new processor for newly added schemas etc.>
> >
> > Could you elaborate on this part of your use case?>
> >
> > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[email protected]>>
> > wrote:>
> >
> > > Hi,>
> > >>
> > > I am experimenting with nifi for one of our usecases with plans of>
> > > extending it to various other data routing, ingestion usecases. Right
> now I>
> > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250>
> > > different schemas and about 3000 tables to read data from. Volume of
> the>
> > > data flow ranges from 500 - 2000 messages per second in different
> schemas.>
> > >>
> > > Right now the problem is mysqlCDC processor can run in only one thread.
> To>
> > > overcome this issue I have two options.>
> > >>
> > > 1. Use primary node execution, so different processors for each of the>
> > > schemas. So eventually all processors which reads from mysql will run
> in>
> > > single node, which will be a bottleneck no matter how big my nifi
> cluster>
> > > is.>
> > >>
> > > 2. Another approach is to use multiple nifi instances to pull data and
> have>
> > > master nifi cluster for ingestion to various sinks. In this approach I
> will>
> > > have to manage all these small nifi instances, and may have to build
> some>
> > > kind of tooling on top of it to monitor/provision new processor for
> newly>
> > > added schemas etc.>
> > >>
> > > Is there any better way to achieve my usecase with nifi ? Please advice
> me>
> > > on the architechture.>
> > >>
> > > Looking forward for suggestion.>
> > >>
> > > - Ashwin>
> > >>
> >
>

Reply via email to