On 10/08/2011 08:58, jagaran das wrote:
In my current project we  are planning to streams of data to Namenode (20 Node 
Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.

That's Gigabyte/s or Gigabit/s?


Few queries:

1. Can a single Namenode handle such high speed writes? Or it becomes 
unresponsive when GC cycle kicks in.

see below

2. Can we have multiple federated Name nodes  sharing the same slaves and then 
we can distribute the writes accordingly.

that won't solve your problem

3. Can multiple region servers of HBase help us ??

no


Please suggest how we can design the streaming part to handle such scale of 
data.

Data is written to datanodes, not namenodes. the NN is used to set up the write chain and then just tracks node health -the data does not go through it.

This changes your problem to one of
-can the NN set up write chains at the speed you want, or do you need to throttle back the file creation rate by writing bigger files
 -can the NN handle the (file x block count) volumes you expect
 -what is the network traffic of the data ingress
-what is the total bandwidth of the replication traffic combined with the data ingress traffic?
 -do you have enough disks for the data
 -do your HDDs have enough bandwidth?
-do you want to do any work with the data, and what CPU/HDD/net load does this generate?
 -what impact will disk & datanode replication traffic have?
 -how much of the backbone will you have to allocated to the rebalancer.

A 1 PB/day, ignoring all network issues, you will reach the current documented HDFS limits within four weeks. What are you going to do then, or will you have processed it down?

I could imagine some experiments you could conduct against a namenode to see what its limits are, but there are lot of datacentre bandwidth and computation details you have to worry above and beyond datanode performance issues.

Like Michael says, 1 PB/day sounds like a homework project, especially if you haven't used hadoop at smaller scale. If it is homework, once you've done the work (and submitted it), it'd be nice to see the final paper.

If it is something you plan to take live, well, there are lots of issues to address, of which the NN is just one of the issues -and one you can test in advance. Ramping up the cluster with different loads will teach you more about the bottlenecks. Otherwise: there are people who know how to run Hadoop at scale, who, in exchange for money, will help you.

-steve

Reply via email to