Hi, Selina, As Gian mentioned, the first thing to set up the real-time stream processing environment is to: a) set up a Kafka cluster; b) set up a YARN cluster. The following links may get you started: https://www.linkedin.com/pulse/20140813032057-89781742-deploy-kafka-cluster-on-aws http://blog.c2b2.co.uk/2014/05/hadoop-v2-overview-and-cluster-setup-on.html
-Yi On Tue, Aug 4, 2015 at 5:58 PM, Job-Selina Wu <swucaree...@gmail.com> wrote: > Dear All: I was looking for the tutorial how to build and run Samza on > AWS and then I found a link below. I am wondering if there is a detail > tutorial about how to build Samza on AWS? > > Sincerely, > Selina > > > https://cwiki.apache.org/confluence/display/SAMZA/FAQ#FAQ-HowshouldSamzaberunonAWS > ? > How should Samza be run on AWS? > > From Gian Merlino: > > - We've been using Samza in production on AWS for a little over a > month. We're > just using the YARN runner on a mostly stock hadoop 2.4.0 cluster (not > EMR). Our experience is that c3s work well for the YARN instances and > i2s > work well for the Kafka instances. Things have been pretty solid with > that > setup. For scaling up and scaling down YARN, we just terminate instances > or add instances, and this works pretty well. It can take a few minutes > for the cluster to realize a node has gone and respawn containers > elsewhere. We have a separate Kafka cluster just for Samza's use, > different from our main Kafka cluster. The main reason is that we wanted > to isolate off the disk and network load of state compactions and > restores (we don't use compacted topics in our main Kafka cluster, but > we do use them with Samza, and the extra load on Kafka can be > substantial). >