Venkatesh, Ok. So 2.5GB/s of throughput then. That is pretty significant volume. As you mentioned much of it is text oriented so there is a good chance compression ratios will be quite considerable. Could get the edge nodes to reduce it down to say 600MB/s overall. At that point you'd need perhaps a 5-6 node decent capability back-end cluster fed via site-to-site from a fairly large number of nifi systems running at the edge to capture their slice of the pie so to speak.
Doable but certainly requires considerable design and tradeoff discussion. Thanks Joe On Fri, Nov 20, 2015 at 10:18 AM, Venkatesh Sellappa <[email protected]> wrote: > @Joe : That’s a great technique, receive from various sources , compress and > then send to a smaller cluster. (and yes it is 150GB , not Gb). > > @Mark : I love the idea of using S3 + SQS to spread the work for pulling the > data across the cluster. The one hitch we might have is there are companies > that are very much against pushing data onto the cloud due to regulatory and > compliance reasons. > > Perhaps a section in the user-guide around real-life usage scenario would be > useful to drive adoption. > > > From: Mark Payne > Sent: 20 November 2015 14:34 > To: [email protected] > Subject: Re: Capacity Planning : Guidelines > > > Venkatesh, > > I will note that recently, S3 has provided the ability to register > notifications via SQS > whenever buckets are updated. So if you want to use S3 in a scalable fashion, > that > is far more do-able now. Generally, the idea is to register S3 to provide a > notification via > SQS and then in NiFi you can have the nodes listening for these notifications > with the > GetSQS Processor. You can then extract the URL from the message via the > EvaluateJsonPath > Processor. Once you have this URL you can use FetchS3Object to pull the > contents. > > This provides a nice mechanism for having a cluster of NiFi nodes to receive > updates to S3 > and then spreads the work of pulling the data in across the entire cluster. > > Thanks > -Mark > > >> On Nov 20, 2015, at 6:33 AM, Joe Witt <[email protected]> wrote: >> >> Hello >> >> So that is about 300MB/s if that really was 150Gb (not GB) per minute. >> If you expect each node to be able to handle say 75MB/s of throughput >> (which would be low but i'm being conservative) then you'd need 4 or >> so boxes to hit that rate. Then assume there will be surges in >> arrival and lulls in processing then say 7-8 nodes. The other thing >> to consider is none of those protocols offer scalable exchange of >> multi-node/queue-based interaction so having the cluster all operate >> efficiently may be non-trivial. In such a case you may be better off >> having a few nodes that gather from a variety of predetermined >> sources, compress the data, then fire to a smaller central cluster. >> >> Anyway, lots of ways to tackle this of course depending on what >> resources you have available to you and the sorts of failure modes you >> can accept vs those you cannot. >> >> Thanks >> Joe >> >> On Fri, Nov 20, 2015 at 4:25 AM, Venkatesh Sellappa >> <[email protected]> wrote: >>> Are there any guidelines on how-to scale up/down NiFI ? >>> >>> (I know we don;t do autoscaling at present and nodes are independent of each >>> other) >>> >>> The use-case is : >>> >>> 16,000 text files (csv, xml, json)/per minute totalling 150Gb are getting >>> delivered onto a combination of FTP, S3, Local Filesystem etc. sources. >>> >>> These files are then ingested with some light processing onto a HDFS >>> cluster. >>> >>> My question is : Are there any best-practices, guidelines , ideas on setting >>> up a NiFI cluster for this kind of volume , throughput ? >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-nifi-developer-list.39713.n7.nabble.com/Capacity-Planning-Guidelines-tp5142.html >>> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com. > >
