The DataNode is not especially memory-intensive (you want to leave
memory to the OS so that it can do fs caching). ZK is recommended to
have 1GB, so I'd move the .5 from DN to ZK.
Otherwise that looks reasonable.
Fernando Padilla wrote:
thank you!
I'll pay attention to the CPU load then. Any tips about the memory
distribution? This is what I'm expecting, but I'm a newb. :)
DataNode - 1.5G
TaskTracker - .5G
Zookeeper - .5G
RegionServer - 2G
M/R - 2G
Jonathan Gray wrote:
IMO, you can fit those things into 6.5G without a problem. Of course,
the more you give it the better your performance.
However, medium instances have only 2 cores... That's going to be a
problem. Under heavy load (especially in an upload/import situation)
you will starve threads in at least one of these processes... At a
minimum, you really want a core each for DN, ZK, RS and then your
requirements for your MR tasks would depend on the nature of them. If
they are at all CPU intensive, then you need to be sure to dedicate
sufficient resources to them.
In general, we recommend XL instances because they are quad core.
Otherwise you will likely run into issues with this many processes on
two cores.
JG
Fernando Padilla wrote:
OK, if you don't mind me stretching this simple conversation a bit
more..
Say I use the medium ec2 instance.. that's about 7.5G of ram, so I
have abgout 6.5 total.
On any one node I would have:
DataNode
TaskTracker
Zookeeper
RegionServer
+Map/Reduce Tasks?
What would your gut be for distributing the memory?
Can I run my M/R Tasks all sharing one JVM to share the same memory,
or does each Map or Reduce have it's own JVM/Memory requirements?
I'm thinking between 5 to 10 nodes. I know that this seems stingy
for what you guys are used to.. but this is my worst case or minimum
allocation.. if need be I can plan to get more nodes and spread
around the load (bursting on heavy days, etc).. but I don't want to
plan/budget for a large number of nodes until we see good ROI, etc
etc etc..
On 7/14/09 11:54 PM, Nitay wrote:
Yes, Ryan's right. While we recommend running ZooKeeper on separate
hosts,
it is really only if you can afford to do so. Otherwise, choose some
of your
region server machines and run ZooKeeper alongside those.
On Tue, Jul 14, 2009 at 10:34 PM, Ryan Rawson<[email protected]>
wrote:
You can probably host it all on one set of machines. You'll need the
large sized.
Let us know how EC2 works, performance might be off due to the
virtualization.
On Tue, Jul 14, 2009 at 10:32 PM, Fernando Padilla<[email protected]>
wrote:
The reason I ask, is that I'm planning on setting up a small HBase
cluster
in ec2..
having 3 to 5 instances just for zookeeper, while having only 3 to 5
instances for Hbase.. it sounds lop-sided. :)
Does anyone here have any experience with HBase in EC2?
Ryan Rawson wrote:
I run my ZK quorum on my regionservers, but I also have 16 GB ram
per
regionserver. I used to run 1gb, and never had problems. Now with
hbase managing the quorum I have 5gb ram, and its probalby over kill
but better save than sorry.
On Tue, Jul 14, 2009 at 6:07 PM, Nitay<[email protected]> wrote:
Hi Fernando,
It is recommended that you run ZooKeeper separate from the Region
Servers.
On the memory side, our use of ZooKeeper in terms of data stored is
minimal
currently. However you definitely don't want it to swap and you
want to
be
able to handle a large number of connections. A safe value would be
something like 1GB.
-n
On Tue, Jul 14, 2009 at 2:58 PM, Fernando
Padilla<[email protected]>
wrote:
So.. what's the recommendation for zookeeper?
should I run zookeeper nodes on the same region servers?
should I run zookeeper nodes external to the region servers?
how much memory should I give zookeeper, if it's just used for
hbase?