Ashwin, You should have no problem here in terms of MergeContent scaling. But there are a few things that you'll want to consider:
1. If you have a cluster of these, you're going to be merging FlowFiles per node. So you'll need to ensure that you have enough data on each node to reach your max of 500 / 200. And you'll want to ensure that you have a timeout set in case you don't fill the bin exactly. 2. The first MergeContent will need to hold (in Java heap) 500 FlowFiles per bin * 2000 bins = 1 million FlowFiles. The second MergeContent will hold up to 400,000 more FlowFiles. So you'll need a decent size heap to avoid Out of Memory Error (probably 8 GB). If you still have issues with Java heap you may want to even consider adding a third MergeContent that does 100 msgs/bin, then a second that does 100 msgs/bin and a third that does 10 msgs/bin. This may not be necessary, but it's something to consider if you want to avoid OOME and don't want to or can't allocate a huge Java heap. The good news here is that the amount of data you're actually merging is not that much, it's just a lot of FlowFiles. So MergeContent should be pretty efficient. I hope this is helpful! Thanks -Mark > On Sep 21, 2018, at 8:32 AM, ashwin konale <[email protected]> wrote: > > I have a nifi workflow to read from multiple mysql binlogs and put it to > hdfs. I am using ChangeDataCapture as a source and PutHdfs as sink. I am > using MergeContent processor in between to chunk the messages together for > hdfs. > > *[CDC (Primary node only. About 200 of this processor for each db)] > *-> *UpdateAttributes(db-table-hour) > -> MergeContent 500 msg(bin based on db-table-hour) ->MergeContent 200 > msg(bin based on db-table-hour) -> Put hdfs* > > I have about ~200 databases to read from, And ~2500 tables altogether. > Update rate for binlogs is around 1Mbps per database. I am planning to run > this on 3 node nifi cluster as of now. > > Has anyone used mergecontent with more than 2000 bins before? Does it scale > well.? Can someone suggest me any improvements to the workflow or > alternatives. > > Thanks > Ashwin
