I don't know of any tutorials, but the order to tackle things would be: 1) Set up ZK- this could be a single node install for a PoC or a 3 or 5 node install for production. m3.medium is a reasonable node type.
2) Set up Kafka- could be a single instance without replication for a PoC. For production, as many as you need, and you'd probably want replication. I think if you want to use local instance storage, i2 instances are good, and if you want to use EBS, probably m3 instances. 3) Set up YARN- this could be a single instance (running pseudo-distributed with master & slave on the same machine) or two instances (one master, one slave) for a PoC. I think c3 or r3 instance types are good for the slaves, depending on how much memory you need. Workloads without large amounts of state should be ok on c3 instances. EMR might actually work for YARN if you use the long-running kind of cluster (see: http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient.html). I haven't tried that, but it might be worth a shot before going for stock apache hadoop. On Tue, Aug 4, 2015 at 5:58 PM, Job-Selina Wu <swucaree...@gmail.com> wrote: > Dear All: I was looking for the tutorial how to build and run Samza on > AWS and then I found a link below. I am wondering if there is a detail > tutorial about how to build Samza on AWS? > > Sincerely, > Selina > > > https://cwiki.apache.org/confluence/display/SAMZA/FAQ#FAQ-HowshouldSamzaberunonAWS > ? > How should Samza be run on AWS? > > From Gian Merlino: > > - We've been using Samza in production on AWS for a little over a > month. We're > just using the YARN runner on a mostly stock hadoop 2.4.0 cluster (not > EMR). Our experience is that c3s work well for the YARN instances and > i2s > work well for the Kafka instances. Things have been pretty solid with > that > setup. For scaling up and scaling down YARN, we just terminate instances > or add instances, and this works pretty well. It can take a few minutes > for the cluster to realize a node has gone and respawn containers > elsewhere. We have a separate Kafka cluster just for Samza's use, > different from our main Kafka cluster. The main reason is that we wanted > to isolate off the disk and network load of state compactions and > restores (we don't use compacted topics in our main Kafka cluster, but > we do use them with Samza, and the extra load on Kafka can be > substantial). >