Hi Osh, I've not ran Samza on EC2 myself but have had numerous other workloads there.
I'm not surprised you find conflicting advice on these topics; hardware selection is a bit of a dark art and on EC2 even more so. For every recommended configuration that works for one person you'll find somebody for whom the exact same config almost destroyed their business. :) If at all possible I'd suggest standing up the config you mentioned and trying it on as realistic a sample of data as you'll see in production. Particularly in terms of instance types and numbers this is the only data point that will actually be guaranteed to be valuable. Your implicit assumption that ZK is likely to be the most sensitive to EC2 weirdness is almost certainly true. It may be worth going through the Zookeeper wiki and mailing list archives for any relevant best practice. Likely not a huge concern if your initial data rates are also low (you only mentioned volumes) but ZK can get into a pretty unhappy state if it starts seeing spikes in latency to the storage or between nodes in the ensemble. One thing I would consider is to run the YARN NM on all 3 hosts -- the YARN RM is relatively lightly used so you are effectively limiting yourself to only 2 nodes for actual stream task processing. Please feed back any experiences you have with Samza on EC2 as I suspect this will become a FAQ entry at some point once we have more experience. There's a desire to more directly support EC2 as a work scheduler but that's purely speculative at this point. Good luck! Garry -----Original Message----- From: Oshoma Momoh [mailto:[email protected]] Sent: 24 April 2014 20:19 To: [email protected] Subject: Getting started with Samza on Amazon EC2 Hi all, I am setting up a Samza cluster for the first time, and am now at the point of deploying on EC2. Hopefully this is the correct place to ask a few newbie questions. I'm impressed and excited by what I've seen so far, eager to get going with a real deployment. 1. Does anyone have good or bad experiences to report in running Samza atop Ubuntu 14.04 LTS? (Versus 12.04.) 2. Any best practices to recommend in terms of setup on EC2? E.g. instance types to use, EBS volumes versus non-EBS, and so on. I've found several threads with conflicting opinions on all of this. Our current plan is... (a) Use EBS volumes, separating Zookeeper from Kafka. (b) Start with three m3.large instances to begin with and upgrade later as needed, since our initial data volume will be low (c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka + Zookeeper + Yarn Resource Manager on the third node. Regards, osh Oshoma Momoh http://pcglab.com ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4355 / Virus Database: 3920/7386 - Release Date: 04/23/14
