I wrote several  Ansible playbooks to deploy YARN (without HDFS), Zookeeper
and Kafka to EC2 for deploying Samza jobs. If you know ansible those
scripts may be helpful. You can find them at
https://github.iu.edu/mpathira/samza-ec2-ansible. I was planning to add
document describing these scripts but could do it yet. I looked at EMR
also, but as I remember EMR job deployment model doesn't work with current
scripts provided by Samza.

I used R3 instances for Kafka and C3 instances for YARN. As I remember I
could get close to 1million msg/s with 3 node Kafka cluster running on
r3.xlarge instance and 2 (or 4) node YARN cluster running 4 stream tasks
per job.

Thanks
Milinda

On Wed, Aug 5, 2015 at 11:27 AM, Gian Merlino <gianmerl...@gmail.com> wrote:

> I don't know of any tutorials, but the order to tackle things would be:
>
> 1) Set up ZK- this could be a single node install for a PoC or a 3 or 5
> node install for production. m3.medium is a reasonable node type.
>
> 2) Set up Kafka- could be a single instance without replication for a PoC.
> For production, as many as you need, and you'd probably want replication. I
> think if you want to use local instance storage, i2 instances are good, and
> if you want to use EBS, probably m3 instances.
>
> 3) Set up YARN- this could be a single instance (running pseudo-distributed
> with master & slave on the same machine) or two instances (one master, one
> slave) for a PoC. I think c3 or r3 instance types are good for the slaves,
> depending on how much memory you need. Workloads without large amounts of
> state should be ok on c3 instances.
>
> EMR might actually work for YARN if you use the long-running kind of
> cluster (see:
>
> http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient.html
> ).
> I haven't tried that, but it might be worth a shot before going for stock
> apache hadoop.
>
> On Tue, Aug 4, 2015 at 5:58 PM, Job-Selina Wu <swucaree...@gmail.com>
> wrote:
>
> > Dear All:    I was looking for the tutorial how to build and run Samza on
> > AWS and then I found a link below. I am wondering if there is a detail
> > tutorial about how to build Samza on AWS?
> >
> > Sincerely,
> > Selina
> >
> >
> >
> https://cwiki.apache.org/confluence/display/SAMZA/FAQ#FAQ-HowshouldSamzaberunonAWS
> > ?
> > How should Samza be run on AWS?
> >
> > From Gian Merlino:
> >
> >    - We've been using Samza in production on AWS for a little over a
> > month. We're
> >    just using the YARN runner on a mostly stock hadoop 2.4.0 cluster (not
> >    EMR). Our experience is that c3s work well for the YARN instances and
> > i2s
> >    work well for the Kafka instances. Things have been pretty solid with
> > that
> >    setup. For scaling up and scaling down YARN, we just terminate
> instances
> >    or add instances, and this works pretty well. It can take a few
> minutes
> >    for the cluster to realize a node has gone and respawn containers
> >    elsewhere. We have a separate Kafka cluster just for Samza's use,
> >    different from our main Kafka cluster. The main reason is that we
> wanted
> >    to isolate off the disk and network load of state compactions and
> >    restores (we don't use compacted topics in our main Kafka cluster, but
> >    we do use them with Samza, and the extra load on Kafka can be
> >    substantial).
> >
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Reply via email to