Thanks for sharing your "painful" experience with us, Lars. I always wondered what would hapeen if HDFS tried to host few hundred million files.
By the way, I think, with the current namenode desgin of Hadoop, it is unlikely that you will ever be able to host 500 million files on a single cluster unless you have a few hundred GB of RAM in a single machine. For your problems, you might want to decrease number of files dramatically - into N million files, where 0 < N < 10~20, to be safe. The N really depends on how much RAM you have in your namenode. For more info on how one may calculate the max number of files in HDFS, you may want to take a look at http://issues.apache.org/jira/browse/HADOOP-1687. You might also want to have some "layer" that combines lots of small files into one big file, and that one big file is the one that gets into HDFS. Let us know how your project goes... :-) On 1/6/08, Lars George <[EMAIL PROTECTED]> wrote: > > Ted, > > In an absolute worst case scenario. I know this is beta and all, but I > start using HBase in a production environment and need to limit downtime > (which is what this architecture promises) to minimum - none at all if I > can. > > All in all, if I cannot rely on HBase yet being stable, what would your > (or do others) recommend to store say 500m documents on a set of 40 > servers with about 1.7TB each. I need this to be reliable, fail-over > safe etc. Basically the Hadoop approach, but without it. And storage is > only one question, how do I parallelize the computing to make use of 80 > CPU's? > > Initially I used Hadoop directly and what happened is that after 4m > documents the HDFS simply died on me, the name server would not start > anymore and I had to format the whole thing again and insert the files > once more, this time in HBase. > > I assume it was my rather simplistic directory structure, but I was not > able to find any FAQ or hint of whatever kind how to lay out the files > best in Hadoop, I had to go with the first assumption. If I would know > how I can use HDFS reliable, then I would not worry too much. > > Is there anyone out there that could help me getting this sorted out? I > am willing to pay consulting fees if I have to. At the moment I am at a > loss - sure I trial and error approach would keep me going forward, but > I am on a tight deadline too and that counters that approach. > > Any help is appreciated. > > Thanks, > Lars > > > Ted Dunning wrote: > > Lars, > > > > Can you dump your documents to external storage (either HDFS or ordinary > > file space storage)? > > > > > > On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote: > > > > > >> Jim, > >> > >> I have inserted about 5million documents into HBase and translate them > into > >> 15 languages (means I end up with about 75million in the end). That > data is > >> only recreatable if we process them costly again. So I am in need of a > >> migration path. > >> > >> For me this is a definitely +1 for a migration tool. > >> > >> Sorry to be a hassle like this. :\ > >> > >> Lars > >> > >> ---- > >> Lars George, CTO > >> WorldLingo > >> > >> > >> Jim Kellerman wrote: > >> > >>> Do you have data stored in HBase that you cannot recreate? > >>> > >>> HADOOP-2478 will introduce an incompatible change in how HBase > >>> lays out files in HDFS so that should the root or meta tables > >>> be corrupted, it will be possible to reconstruct them from > >>> information in the file system alone. > >>> > >>> The problem is in building a migration utility. Anything that > >>> we could build to migrate from the current file structure to > >>> the new file structure would require that the root an meta > >>> regions be absolutely correct. If they are not, the migration > >>> would fail, because there is not enough information on disk > >>> currently to rebuild the root and meta regions. > >>> > >>> Is it acceptable for this change to be made without the provision > >>> of an upgrade utility? > >>> > >>> If not, are you willing to accept the risk that the upgrade > >>> may fail if you have corruption in your root or meta regions? > >>> > >>> After HADOOP-2478, we will be able to build a fault tolerant > >>> upgrade utility, should HBase's file structure change again. > >>> Additionally, we will be able to provide the equivalent of > >>> fsck for HBase after HADOOP-2478. > >>> > >>> --- > >>> Jim Kellerman, Senior Engineer; Powerset > >>> > >>> No virus found in this outgoing message. > >>> Checked by AVG Free Edition. > >>> Version: 7.5.516 / Virus Database: 269.17.13/1207 - Release Date: > 1/2/2008 > >>> 11:29 AM > >>> > >>> > >>> > >>> > > > > > -- Taeho Kang [tkang.blogspot.com] Software Engineer, NHN Corporation, Korea
