Hi Osh,

I've not ran Samza on EC2 myself but have had numerous other workloads there.

I'm not surprised you find conflicting advice on these topics; hardware 
selection is a bit of a dark art and on EC2 even more so. For every recommended 
configuration that works for one person you'll find somebody for whom the exact 
same config almost destroyed their business. :)

If at all possible I'd suggest standing up the config you mentioned and trying 
it on as realistic a sample of data as you'll see in production. Particularly 
in terms of instance types and numbers this is the only data point that will 
actually be guaranteed to be valuable.

Your implicit assumption that ZK is likely to be the most sensitive to EC2 
weirdness is almost certainly true. It may be worth going through the Zookeeper 
wiki and mailing list archives for any relevant best practice. Likely not a 
huge concern if your initial data rates are also low (you only mentioned 
volumes) but ZK can get into a pretty unhappy state if it starts seeing spikes 
in latency to the storage or between nodes in the ensemble.

One thing I would consider is to run the YARN NM on all 3 hosts -- the YARN RM 
is relatively lightly used so you are effectively limiting yourself to only 2 
nodes for actual stream task processing.

Please feed back any experiences you have with Samza on EC2 as I suspect this 
will become a FAQ entry at some point once we have more experience. There's a 
desire to more directly support EC2 as a work scheduler but that's purely 
speculative at this point.

Good luck!
Garry

-----Original Message-----
From: Oshoma Momoh [mailto:[email protected]] 
Sent: 24 April 2014 20:19
To: [email protected]
Subject: Getting started with Samza on Amazon EC2

Hi all,

I am setting up a Samza cluster for the first time, and am now at the point of 
deploying on EC2.  Hopefully this is the correct place to ask a few newbie 
questions. I'm impressed and excited by what I've seen so far, eager to get 
going with a real deployment.

1. Does anyone have good or bad experiences to report in running Samza atop 
Ubuntu 14.04 LTS? (Versus 12.04.)

2. Any best practices to recommend in terms of setup on EC2? E.g. instance 
types to use, EBS volumes versus non-EBS, and so on.  I've found several 
threads with conflicting opinions on all of this. Our current plan is...
(a) Use EBS volumes, separating Zookeeper from Kafka.
(b) Start with three m3.large instances to begin with and upgrade later as 
needed, since our initial data volume will be low
(c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka + 
Zookeeper + Yarn Resource Manager on the third node.

Regards,

osh

Oshoma Momoh
http://pcglab.com

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4355 / Virus Database: 3920/7386 - Release Date: 04/23/14

Reply via email to