He didn't store all of these documents in separate files. He stored them in hbase, hence his pain with the upgrade.
On 1/6/08 5:44 PM, "Taeho Kang" <[EMAIL PROTECTED]> wrote: > Thanks for sharing your "painful" experience with us, Lars. I always > wondered what would hapeen if HDFS tried to host few hundred million files. > > By the way, I think, with the current namenode desgin of Hadoop, it is > unlikely that you will ever be able to host 500 million files on a single > cluster unless you have a few hundred GB of RAM in a single machine. > > For your problems, you might want to decrease number of files dramatically > - into N million files, where 0 < N < 10~20, to be safe. The N really > depends on how much RAM you have in your namenode. > > For more info on how one may calculate the max number of files in > HDFS, you may want to take a look at > http://issues.apache.org/jira/browse/HADOOP-1687. > > You might also want to have some "layer" that combines lots of small files > into one big file, and that one big file is the one that gets into HDFS. > > Let us know how your project goes... :-) > > On 1/6/08, Lars George <[EMAIL PROTECTED]> wrote: >> >> Ted, >> >> In an absolute worst case scenario. I know this is beta and all, but I >> start using HBase in a production environment and need to limit downtime >> (which is what this architecture promises) to minimum - none at all if I >> can. >> >> All in all, if I cannot rely on HBase yet being stable, what would your >> (or do others) recommend to store say 500m documents on a set of 40 >> servers with about 1.7TB each. I need this to be reliable, fail-over >> safe etc. Basically the Hadoop approach, but without it. And storage is >> only one question, how do I parallelize the computing to make use of 80 >> CPU's? >> >> Initially I used Hadoop directly and what happened is that after 4m >> documents the HDFS simply died on me, the name server would not start >> anymore and I had to format the whole thing again and insert the files >> once more, this time in HBase. >> >> I assume it was my rather simplistic directory structure, but I was not >> able to find any FAQ or hint of whatever kind how to lay out the files >> best in Hadoop, I had to go with the first assumption. If I would know >> how I can use HDFS reliable, then I would not worry too much. >> >> Is there anyone out there that could help me getting this sorted out? I >> am willing to pay consulting fees if I have to. At the moment I am at a >> loss - sure I trial and error approach would keep me going forward, but >> I am on a tight deadline too and that counters that approach. >> >> Any help is appreciated. >> >> Thanks, >> Lars >> >> >> Ted Dunning wrote: >>> Lars, >>> >>> Can you dump your documents to external storage (either HDFS or ordinary >>> file space storage)? >>> >>> >>> On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote: >>> >>> >>>> Jim, >>>> >>>> I have inserted about 5million documents into HBase and translate them >> into >>>> 15 languages (means I end up with about 75million in the end). That >> data is >>>> only recreatable if we process them costly again. So I am in need of a >>>> migration path. >>>> >>>> For me this is a definitely +1 for a migration tool. >>>> >>>> Sorry to be a hassle like this. :\ >>>> >>>> Lars >>>> >>>> ---- >>>> Lars George, CTO >>>> WorldLingo >>>> >>>> >>>> Jim Kellerman wrote: >>>> >>>>> Do you have data stored in HBase that you cannot recreate? >>>>> >>>>> HADOOP-2478 will introduce an incompatible change in how HBase >>>>> lays out files in HDFS so that should the root or meta tables >>>>> be corrupted, it will be possible to reconstruct them from >>>>> information in the file system alone. >>>>> >>>>> The problem is in building a migration utility. Anything that >>>>> we could build to migrate from the current file structure to >>>>> the new file structure would require that the root an meta >>>>> regions be absolutely correct. If they are not, the migration >>>>> would fail, because there is not enough information on disk >>>>> currently to rebuild the root and meta regions. >>>>> >>>>> Is it acceptable for this change to be made without the provision >>>>> of an upgrade utility? >>>>> >>>>> If not, are you willing to accept the risk that the upgrade >>>>> may fail if you have corruption in your root or meta regions? >>>>> >>>>> After HADOOP-2478, we will be able to build a fault tolerant >>>>> upgrade utility, should HBase's file structure change again. >>>>> Additionally, we will be able to provide the equivalent of >>>>> fsck for HBase after HADOOP-2478. >>>>> >>>>> --- >>>>> Jim Kellerman, Senior Engineer; Powerset >>>>> >>>>> No virus found in this outgoing message. >>>>> Checked by AVG Free Edition. >>>>> Version: 7.5.516 / Virus Database: 269.17.13/1207 - Release Date: >> 1/2/2008 >>>>> 11:29 AM >>>>> >>>>> >>>>> >>>>> >>> >>> >> > >