I've added the above recommendations along with others to our Operations wiki - https://cwiki.apache.org/confluence/display/KAFKA/Operations#Operations-Zookeeper
Thanks, Neha On Tue, Aug 7, 2012 at 4:14 PM, Joel Koshy <jjkosh...@gmail.com> wrote: > Here are some comments from Dave. He'll be adding some more details to a > blog post that we can link from that wiki. > > For the most part, yes, it is pretty much the obvious, but here's the > short version of some longer details that I really should get into the > wiki that you pointed at: > - Redundancy in the physical/hardware/network layout: try not to put > them all in the same rack, decent (but don't go nuts) hardware, try to > keep redundant power and network paths, etc > - I/O segregation: if you do a lot of write type traffic you'll almost > definitely want the transaction logs on a different disk group than > app logs and snapshots (the write to the zookeeper service has a > synchronous write to disk, which can be slow). > - Application segregation: Unless you really understand the application > patterns of other apps that you want to install on the same box, it > can be a good idea to run zookeeper in isolation (though this can be a > balancing act with the capabilities of the hardware). > - Use care with virtualization: It can work, depending on your cluster > layout and read/write patterns and SLAs, but the tiny overheads > introduced by the virtualization layer can add up and throw off > zookeeper, as it can be very time sensitive > - Zookeeper configuration and monitoring: It's java, make sure you give > it 'enough' heap space (I usually run them with 3-5G, but that's > mostly due to the data set size we have here). Unfortunately I don't > have a good formula for it. As far as monitoring, both JMZ and the 4 > letter commands are very useful, they do overlap in some cases (and in > those cases I prefer the 4 letter commands, they seem more > predictable, or at the very least, they work better with the LI > monitoring infrastructure) > - Don't overbuild the cluster: large clusters, especially in a write > heavy usage pattern, means a lot of intra cluster communication > (quorums on the writes and subsequent cluster member updates), but > don't underbuild it (and risk swamping the cluster). > > Overall, I try to keep the zookeeper system as small as will handle the > load (plus standard growth capacity planning) and as simple as possible. > I try not to do anything fancy with the configuration or application > layout as compared to the official release as well as keep it as self > contained as possible. For these reasons, I tend to skip the OS > packaged versions, since it has a tendency to try to put things in the > OS standard hierarchy, which can be 'messy', for want of a better way to > word it. > > > On Tue, Aug 7, 2012 at 12:00 PM, James A. Robinson < > jim.robin...@stanford.edu> wrote: > >> Hi folks, >> >> The operations wiki page >> >> https://cwiki.apache.org/confluence/display/KAFKA/Operations >> >> states, in part >> >> Zookeeper >> >> Zookeeper is essential for the correct operation of Kafka. There are >> a number of things that must be done to keep zookeeper running >> happily as we have learned the hard way, hopefully Dave and Neha >> will add this since I don't know what we did. >> >> I was wondering if anyone on the list had comments on this topic? >> >> Were there things beyond what might be considered obvious, e.g., >> running at least five nodes on separate machines w/ redundant network >> paths, and so forth? >> >> >> Jim >> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> James A. Robinson jim.robin...@stanford.edu >> Stanford University HighWire Press http://highwire.stanford.edu/ >> +1 650 7237294 (Work) +1 650 7259335 (Fax) >>