Re: Max number of files in HDFS?

Eric Baldeschwieler Wed, 29 Aug 2007 01:21:50 -0700

Keeping all the datastructures simple and in ram let's us keep thetransaction rate pretty high.

Going to a DB while keeping the transaction rate up would require alot of engineering. And would add complexity to administering thesystem.


I'm not a fan of this approach, at least not without a lot of analysis.

We are considering other approaches, but a limit of 10s of millionsof files per cluster seems acceptable in the medium term. It is anacceptable compromise between simplicity, efficiency and scalability.

I'm sure the we will continue to revisit this on the list regularly.It is clearly a frustrating limit. It is also one that can clearlybe worked around.


On Aug 28, 2007, at 7:59 PM, Taeho Kang wrote:

I thank you for all who have taken their time to answer my questions.
We've been using version 0.13. We do not have much of a problemright now,but we will surely upgrade the system quite soon to 0.14 or 0.15once it's
released.
What are your opinions on implementing the namenode metadatamanagement
using a DB? (maybe as a subproject?)
Do you think it will make the system more scalable or will theadditional
complexity of using the DB is not worthy of consideration?

/Taeho

On 8/29/07, Sameer Paranjpye <[EMAIL PROTECTED]> wrote:
>
>
>
> Taeho Kang wrote:
> > Hello Sameer. Thank you for your useful link. It's been veryhelpful!
> >
> > By the way, our Hadoop cluster has a namenode with 4GBytes of RAM.
> >
> > Based on the analysis found in the HADOOP-1687 (
> > http://issues.apache.org/jira/browse/HADOOP-1687), we couldprobably> > state that for every 1G of RAM gives the namenode a power tomanage
> > 1,000,000 files, to be conservative (10,600,000 files / 9GBytes =
> > 1,177,777 files / 1GBytes)
>
> That analysis is based on a Hadoop 0.13 deployment. Hadoop 0.14
> significantly improves on that by removing .crc files in theNamenode.> If you have lots of small files in HDFS it will effectively cutmemory
> usage in half. What version of Hadoop are you using?
>
> HADOOP-1687 also outlines an approach to further reduce memoryusage in> the Namenode that could show a further 40% improvement. Thisought to be
> done in time for Hadoop 0.15. Beyond that the direction to take is
> unclear and there hasn't been a lot of discussion. As and when it
> happens it'll show up on the bug list, so stay tuned.
>
> >
> > If I were to apply this "rule" to my 4GB RAM namenode, itshould have an
> > ability to manage 4,000,000 files.
> > The number of files being stored onto our Hadoop DFS is5000~6000 files> > a day. That gives about 7~800 days, assuming the number offiles stored> > each day stay at the current level. Unfortunately, it has beensteadily
> > going up as more people in our company joined the fun of using the
> > Hadoop cluster.
> >
> > Is there a plan to redesign the Namenode in a way it doesn'thave this
> > limit? (e.g. use a DB for metadata management)
> >
> > Thank you once again!
> >
> > /Taeho
> >
> >
> >
> >
> > On 8/28/07, *Sameer Paranjpye* <[EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]>> wrote:
> >
> >     How much memory does your Namenode machine have?
> >
> > You should look at the number of files, directories andblocks on
> your
> >     installation. All these numbers are available via
> NamenodeFsck.Result
> >
> > HADOOP-1687 ( http://issues.apache.org/jira/browse/HADOOP-1687) has
> a
> > detailed discussion of the amount of memory used byNamenode data
> >     structures.
> >
> >     Sameer
> >
> >     Taeho Kang wrote:
> >      > Dear All,
> >      >
> > > Hi, my name is Taeho and I am trying to figure out themaximum
> >     number of
> >      > files a namenode can hold.
> >      > The main reason for doing this is that I want to have some
> >     estimates on how
> >      > many files I can put into the HDFS without overflowing the
> Namenode
> >      > machine's memory.
> >      >
> > > I know the number depends on the size of memory and howmuch is
> >     allocated
> >      > for the running JVM.
> > > For the memory usage by the namenode, I can simply useRuntime
> >     object of
> >      > JDK.
> > > For the total number of files residing in the DFS, I amthinking
> >     of using
> >      > getTotailfiles() funcion of NamenodeFsck class in
> >      > org.apache.hadoop.dfspacakge. Am I correct here in using
> >     NamenodeFsck?
> >      >
> >      > Or, has anybody done similar experiments?
> >      >
> >      > Any comments/suggestions will be appreciated.
> >      > Thanks in advance.
> >      > Best Regards,
> >      >
> >
> >
> >
> >
> > --
> > Taeho Kang [ tkang.blogspot.com <http://tkang.blogspot.com>]
>



--
Taeho Kang [tkang.blogspot.com]
Software Engineer, NHN Corporation, Korea

Re: Max number of files in HDFS?

Reply via email to