What would cause the name node to have a GC issue? - I am writing opening at max 5000 connections and writing continuously through those 5000 connections to 5000 files at a time. - The volume of data that I would write through 5000 connections cannot be controlled as it is depends on upstream applications that publish data.
Now if the heap memory nears the full size (let say M GB) and when the major GC cycle kicks in, the NameNode could stop responding for some time. This "stop the world" time should be directly proportional to the Heap Size. This may cause the data being blogged on the streaming application's memory. As of our architecture, It has a cluster of JMS Queue and We have multithreaded application that picks the messages from the queue and streams it to NameNode of a 20 Node cluster using FileSystem API as exposed. BTW, in real world if you have a fast car, you can race and win against a slow train, it all depends from what reference frame you are in :) Regards, Jagaran ________________________________ From: Michel Segel <[email protected]> To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]>; jagaran das <[email protected]> Sent: Wednesday, 10 August 2011 11:26 AM Subject: Re: Namenode Scalability So many questions, why stop there? First question... What would cause the name node to have a GC issue? Second question... You're streaming 1PB a day. Is this a single stream of data? Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream? Are you also filtering the data so that you are not saving all of the data? This sounds like a homework assignment than a real world problem. I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-) Sent from a remote device. Please excuse any typos... Mike Segel On Aug 10, 2011, at 12:07 PM, jagaran das <[email protected]> wrote: > To be precise, the projected data is around 1 PB. > But the publishing rate is also around 1GBPS. > > Please suggest. > > > ________________________________ > From: jagaran das <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Wednesday, 10 August 2011 12:58 AM > Subject: Namenode Scalability > > In my current project we are planning to streams of data to Namenode (20 > Node Cluster). > Data Volume would be around 1 PB per day. > But there are application which can publish data at 1GBPS. > > Few queries: > > 1. Can a single Namenode handle such high speed writes? Or it becomes > unresponsive when GC cycle kicks in. > 2. Can we have multiple federated Name nodes sharing the same slaves and > then we can distribute the writes accordingly. > 3. Can multiple region servers of HBase help us ?? > > Please suggest how we can design the streaming part to handle such scale of > data. > > Regards, > Jagaran Das
