He didn't store all of these documents in separate files.  He stored them in
hbase, hence his pain with the upgrade.


On 1/6/08 5:44 PM, "Taeho Kang" <[EMAIL PROTECTED]> wrote:

> Thanks for sharing your "painful" experience with us, Lars. I always
> wondered what would hapeen if HDFS tried to host few hundred million files.
> 
> By the way, I think, with the current namenode desgin of Hadoop, it is
> unlikely that you will ever be able to host 500 million files on a single
> cluster unless you have a few hundred GB of RAM in a single machine.
> 
> For your problems, you might want to decrease number of files dramatically
> - into N million files, where 0 < N < 10~20, to be safe. The N really
> depends on how much RAM you have in your namenode.
> 
> For more info on how one may calculate the max number of files in
> HDFS, you may want to take a look at
> http://issues.apache.org/jira/browse/HADOOP-1687.
> 
> You might also want to have some "layer" that combines lots of small files
> into one big file, and that one big file is the one that gets into HDFS.
> 
> Let us know how your project goes... :-)
> 
> On 1/6/08, Lars George <[EMAIL PROTECTED]> wrote:
>> 
>> Ted,
>> 
>> In an absolute worst case scenario. I know this is beta and all, but I
>> start using HBase in a production environment and need to limit downtime
>> (which is what this architecture promises) to minimum - none at all if I
>> can.
>> 
>> All in all, if I cannot rely on HBase yet being stable, what would your
>> (or do others) recommend to store say 500m documents on a set of 40
>> servers with about 1.7TB each. I need this to be reliable, fail-over
>> safe etc. Basically the Hadoop approach, but without it. And storage is
>> only one question, how do I parallelize the computing to make use of 80
>> CPU's?
>> 
>> Initially I used Hadoop directly and what happened is that after 4m
>> documents the HDFS simply died on me, the name server would not start
>> anymore and I had to format the whole thing again and insert the files
>> once more, this time in HBase.
>> 
>> I assume it was my rather simplistic directory structure, but I was not
>> able to find any FAQ or hint of whatever kind how to lay out the files
>> best in Hadoop, I had to go with the first assumption. If I would know
>> how I can use HDFS reliable, then I would not worry too much.
>> 
>> Is there anyone out there that could help me getting this sorted out? I
>> am willing to pay consulting fees if I have to. At the moment I am at a
>> loss - sure I trial and error approach would keep me going forward, but
>> I am on a tight deadline too and that counters that approach.
>> 
>> Any help is appreciated.
>> 
>> Thanks,
>> Lars
>> 
>> 
>> Ted Dunning wrote:
>>> Lars,
>>> 
>>> Can you dump your documents to external storage (either HDFS or ordinary
>>> file space storage)?
>>> 
>>> 
>>> On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote:
>>> 
>>> 
>>>> Jim,
>>>> 
>>>> I have inserted about 5million documents into HBase and translate them
>> into
>>>> 15 languages (means I end up with about 75million in the end). That
>> data is
>>>> only recreatable if we process them costly again. So I am in need of a
>>>> migration path.
>>>> 
>>>> For me this is a definitely +1 for a migration tool.
>>>> 
>>>> Sorry to be a hassle like this. :\
>>>> 
>>>> Lars
>>>> 
>>>> ----
>>>> Lars George, CTO
>>>> WorldLingo
>>>> 
>>>> 
>>>> Jim Kellerman wrote:
>>>> 
>>>>> Do you have data stored in HBase that you cannot recreate?
>>>>> 
>>>>> HADOOP-2478 will introduce an incompatible change in how HBase
>>>>> lays out files in HDFS so that should the root or meta tables
>>>>> be corrupted, it will be possible to reconstruct them from
>>>>> information in the file system alone.
>>>>> 
>>>>> The problem is in building a migration utility. Anything that
>>>>> we could build to migrate from the current file structure to
>>>>> the new file structure would require that the root an meta
>>>>> regions be absolutely correct. If they are not, the migration
>>>>> would fail, because there is not enough information on disk
>>>>> currently to rebuild the root and meta regions.
>>>>> 
>>>>> Is it acceptable for this change to be made without the provision
>>>>> of an upgrade utility?
>>>>> 
>>>>> If not, are you willing to accept the risk that the upgrade
>>>>> may fail if you have corruption in your root or meta regions?
>>>>> 
>>>>> After HADOOP-2478, we will be able to build a fault tolerant
>>>>> upgrade utility, should HBase's file structure change again.
>>>>> Additionally, we will be able to provide the equivalent of
>>>>> fsck for HBase after HADOOP-2478.
>>>>> 
>>>>> ---
>>>>> Jim Kellerman, Senior Engineer; Powerset
>>>>> 
>>>>> No virus found in this outgoing message.
>>>>> Checked by AVG Free Edition.
>>>>> Version: 7.5.516 / Virus Database: 269.17.13/1207 - Release Date:
>> 1/2/2008
>>>>> 11:29 AM
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 

Reply via email to