Hi Taeho,

Fortunately for us, we don't have a need for storing millions of files in
HDFS just yet. We are adding only a few thousand files a day, so that gives
us a handful of days. And we've been using Hadoop more than a year, and its
reliability has been superb.

Sounds great.

This is just a rough estimation, but we see that 1GB of RAM is required in
namenode for every 1 million files. Newer versions of Hadoop have more
optimized namenode, hence it could host more files. But to be conservative,
we see 6-7 million files is the limit for a 8GB namenode machine.

Ah, that would explain why my first attempt failed, I have a namenode with 1GB of RAM running. That worked OK up to about 3m files, then it died - completely. I am using now a nightly build of Hadoop/Hbase, does that mean I am in better shape now? How much better does it perform?

I don't think adding the "consolidation" feature into Hadoop is a good idea.
As I said, you may have to add an "layer" that does the consolidation work,
and use that layer only when necessary.

Yes of course, that is what I meant, we have to handle the creation of the slaps on our end. But that is where I think we have to reinvent the wheel so to speak.

As far as the performance is concerned, I don't think it's much of an issue.
The only cost I can think of is the time taken to make a query to a DB plus
some time to find the desired file from a given "slap."

OK, my concern is more the size of each slap. Doing some quick math (correct me if I am wrong), 80TB total storage divided by say a max of 1m slaps means 83MB per slap. That is quite a chunk to load. Unless I can do a positioned load of the chunk out of a slap. Does Hadoop have a seek load feature?

Also, you may also create a slap in a way no one file can overlap more than
one slap.

Yes, that makes sense. I could think of for example simply add files together, like an mbox. Or use a ZIP archive. First I would cache enough files in a scratch directory in Hadoop and then archive them as one slap. (Again that sounds similar to what Hbase is doing?)

Updates... woo.. here we go again. Hadoop is not designed to handle this
need. Basically, its HDFS is designed for large files that rarely change -

Yes, understood. I could think of replacing whole slaps, or delete slaps once all contained files are obsolete.

Let us know how your situation goes.

Will do.

Lars

Reply via email to