I initially thought you were saying that you had 250 Avro schemas that you had to use, as in 250 different distinct data models.
Maybe someone else has a suggestion on how to do it, but I think this may just be a fundamental problem of having that many different databases in MySQL and trying to do CDC with them. Is there a hard business requirement to segregate data like that or some factor like pulling from many remote databases that is at play here? On Thu, Oct 18, 2018 at 6:19 AM ashwin konale <[email protected]> wrote: > Hi, > > The flow is like this, > > MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) > > But we have around 250 schemas to pull data from, So with clustering setup, > > MysqlCDC_schema1 -> RPG > MysqlCDC_schema2 -> RPG > MysqlCDC_schema3 -> RPG and so on > > InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) > > But MysqlCDC can run only in primary node in the cluster, I will end up > running all of input processors in single node. This can easily become > bottleneck with increasing number of schemas we have. Could you suggest me > any alternative approach to this problem. > > On 2018/10/17 21:14:09, Mike Thomsen <[email protected]> wrote: > > > may have to build some kind of tooling on top of it to > monitor/provision> > > new processor for newly added schemas etc.> > > > > Could you elaborate on this part of your use case?> > > > > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[email protected]>> > > wrote:> > > > > > Hi,> > > >> > > > I am experimenting with nifi for one of our usecases with plans of> > > > extending it to various other data routing, ingestion usecases. Right > now I> > > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250> > > > different schemas and about 3000 tables to read data from. Volume of > the> > > > data flow ranges from 500 - 2000 messages per second in different > schemas.> > > >> > > > Right now the problem is mysqlCDC processor can run in only one thread. > To> > > > overcome this issue I have two options.> > > >> > > > 1. Use primary node execution, so different processors for each of the> > > > schemas. So eventually all processors which reads from mysql will run > in> > > > single node, which will be a bottleneck no matter how big my nifi > cluster> > > > is.> > > >> > > > 2. Another approach is to use multiple nifi instances to pull data and > have> > > > master nifi cluster for ingestion to various sinks. In this approach I > will> > > > have to manage all these small nifi instances, and may have to build > some> > > > kind of tooling on top of it to monitor/provision new processor for > newly> > > > added schemas etc.> > > >> > > > Is there any better way to achieve my usecase with nifi ? Please advice > me> > > > on the architechture.> > > >> > > > Looking forward for suggestion.> > > >> > > > - Ashwin> > > >> > > >
