I wrote several Ansible playbooks to deploy YARN (without HDFS), Zookeeper and Kafka to EC2 for deploying Samza jobs. If you know ansible those scripts may be helpful. You can find them at https://github.iu.edu/mpathira/samza-ec2-ansible. I was planning to add document describing these scripts but could do it yet. I looked at EMR also, but as I remember EMR job deployment model doesn't work with current scripts provided by Samza.
I used R3 instances for Kafka and C3 instances for YARN. As I remember I could get close to 1million msg/s with 3 node Kafka cluster running on r3.xlarge instance and 2 (or 4) node YARN cluster running 4 stream tasks per job. Thanks Milinda On Wed, Aug 5, 2015 at 11:27 AM, Gian Merlino <gianmerl...@gmail.com> wrote: > I don't know of any tutorials, but the order to tackle things would be: > > 1) Set up ZK- this could be a single node install for a PoC or a 3 or 5 > node install for production. m3.medium is a reasonable node type. > > 2) Set up Kafka- could be a single instance without replication for a PoC. > For production, as many as you need, and you'd probably want replication. I > think if you want to use local instance storage, i2 instances are good, and > if you want to use EBS, probably m3 instances. > > 3) Set up YARN- this could be a single instance (running pseudo-distributed > with master & slave on the same machine) or two instances (one master, one > slave) for a PoC. I think c3 or r3 instance types are good for the slaves, > depending on how much memory you need. Workloads without large amounts of > state should be ok on c3 instances. > > EMR might actually work for YARN if you use the long-running kind of > cluster (see: > > http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient.html > ). > I haven't tried that, but it might be worth a shot before going for stock > apache hadoop. > > On Tue, Aug 4, 2015 at 5:58 PM, Job-Selina Wu <swucaree...@gmail.com> > wrote: > > > Dear All: I was looking for the tutorial how to build and run Samza on > > AWS and then I found a link below. I am wondering if there is a detail > > tutorial about how to build Samza on AWS? > > > > Sincerely, > > Selina > > > > > > > https://cwiki.apache.org/confluence/display/SAMZA/FAQ#FAQ-HowshouldSamzaberunonAWS > > ? > > How should Samza be run on AWS? > > > > From Gian Merlino: > > > > - We've been using Samza in production on AWS for a little over a > > month. We're > > just using the YARN runner on a mostly stock hadoop 2.4.0 cluster (not > > EMR). Our experience is that c3s work well for the YARN instances and > > i2s > > work well for the Kafka instances. Things have been pretty solid with > > that > > setup. For scaling up and scaling down YARN, we just terminate > instances > > or add instances, and this works pretty well. It can take a few > minutes > > for the cluster to realize a node has gone and respawn containers > > elsewhere. We have a separate Kafka cluster just for Samza's use, > > different from our main Kafka cluster. The main reason is that we > wanted > > to isolate off the disk and network load of state compactions and > > restores (we don't use compacted topics in our main Kafka cluster, but > > we do use them with Samza, and the extra load on Kafka can be > > substantial). > > > -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org