Re: Namenode Scalability

Steve Loughran Sat, 13 Aug 2011 13:17:34 -0700

On 10/08/2011 08:58, jagaran das wrote:

In my current project we  are planning to streams of data to Namenode (20 Node 
Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.


That's Gigabyte/s or Gigabit/s?


Few queries:

1. Can a single Namenode handle such high speed writes? Or it becomes 
unresponsive when GC cycle kicks in.


see below

2. Can we have multiple federated Name nodes  sharing the same slaves and then 
we can distribute the writes accordingly.


that won't solve your problem

3. Can multiple region servers of HBase help us ??

no


Please suggest how we can design the streaming part to handle such scale of 
data.

Data is written to datanodes, not namenodes. the NN is used to set upthe write chain and then just tracks node health -the data does not gothrough it.


This changes your problem to one of

-can the NN set up write chains at the speed you want, or do you needto throttle back the file creation rate by writing bigger files

 -can the NN handle the (file x block count) volumes you expect
 -what is the network traffic of the data ingress

-what is the total bandwidth of the replication traffic combined withthe data ingress traffic?

 -do you have enough disks for the data
 -do your HDDs have enough bandwidth?

-do you want to do any work with the data, and what CPU/HDD/net loaddoes this generate?

 -what impact will disk & datanode replication traffic have?
 -how much of the backbone will you have to allocated to the rebalancer.

A 1 PB/day, ignoring all network issues, you will reach the currentdocumented HDFS limits within four weeks. What are you going to do then,or will you have processed it down?

I could imagine some experiments you could conduct against a namenode tosee what its limits are, but there are lot of datacentre bandwidth andcomputation details you have to worry above and beyond datanodeperformance issues.

Like Michael says, 1 PB/day sounds like a homework project, especiallyif you haven't used hadoop at smaller scale. If it is homework, onceyou've done the work (and submitted it), it'd be nice to see the finalpaper.

If it is something you plan to take live, well, there are lots of issuesto address, of which the NN is just one of the issues -and one you cantest in advance. Ramping up the cluster with different loads will teachyou more about the bottlenecks. Otherwise: there are people who know howto run Hadoop at scale, who, in exchange for money, will help you.


-steve

Re: Namenode Scalability

Reply via email to