Ted,
This is actually both, I first tried Hadoop directly and then HBase in
my second attempt.
Lars
Ted Dunning wrote:
He didn't store all of these documents in separate files. He stored them in
hbase, hence his pain with the upgrade.
On 1/6/08 5:44 PM, "Taeho Kang" <[EMAIL PROTECTED]> wrote:
Thanks for sharing your "painful" experience with us, Lars. I always
wondered what would hapeen if HDFS tried to host few hundred million files.
By the way, I think, with the current namenode desgin of Hadoop, it is
unlikely that you will ever be able to host 500 million files on a single
cluster unless you have a few hundred GB of RAM in a single machine.
For your problems, you might want to decrease number of files dramatically
- into N million files, where 0 < N < 10~20, to be safe. The N really
depends on how much RAM you have in your namenode.
For more info on how one may calculate the max number of files in
HDFS, you may want to take a look at
http://issues.apache.org/jira/browse/HADOOP-1687.
You might also want to have some "layer" that combines lots of small files
into one big file, and that one big file is the one that gets into HDFS.
Let us know how your project goes... :-)
On 1/6/08, Lars George <[EMAIL PROTECTED]> wrote:
Ted,
In an absolute worst case scenario. I know this is beta and all, but I
start using HBase in a production environment and need to limit downtime
(which is what this architecture promises) to minimum - none at all if I
can.
All in all, if I cannot rely on HBase yet being stable, what would your
(or do others) recommend to store say 500m documents on a set of 40
servers with about 1.7TB each. I need this to be reliable, fail-over
safe etc. Basically the Hadoop approach, but without it. And storage is
only one question, how do I parallelize the computing to make use of 80
CPU's?
Initially I used Hadoop directly and what happened is that after 4m
documents the HDFS simply died on me, the name server would not start
anymore and I had to format the whole thing again and insert the files
once more, this time in HBase.
I assume it was my rather simplistic directory structure, but I was not
able to find any FAQ or hint of whatever kind how to lay out the files
best in Hadoop, I had to go with the first assumption. If I would know
how I can use HDFS reliable, then I would not worry too much.
Is there anyone out there that could help me getting this sorted out? I
am willing to pay consulting fees if I have to. At the moment I am at a
loss - sure I trial and error approach would keep me going forward, but
I am on a tight deadline too and that counters that approach.
Any help is appreciated.
Thanks,
Lars
Ted Dunning wrote:
Lars,
Can you dump your documents to external storage (either HDFS or ordinary
file space storage)?
On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote:
Jim,
I have inserted about 5million documents into HBase and translate them
into
15 languages (means I end up with about 75million in the end). That
data is
only recreatable if we process them costly again. So I am in need of a
migration path.
For me this is a definitely +1 for a migration tool.
Sorry to be a hassle like this. :\
Lars
----
Lars George, CTO
WorldLingo
Jim Kellerman wrote:
Do you have data stored in HBase that you cannot recreate?
HADOOP-2478 will introduce an incompatible change in how HBase
lays out files in HDFS so that should the root or meta tables
be corrupted, it will be possible to reconstruct them from
information in the file system alone.
The problem is in building a migration utility. Anything that
we could build to migrate from the current file structure to
the new file structure would require that the root an meta
regions be absolutely correct. If they are not, the migration
would fail, because there is not enough information on disk
currently to rebuild the root and meta regions.
Is it acceptable for this change to be made without the provision
of an upgrade utility?
If not, are you willing to accept the risk that the upgrade
may fail if you have corruption in your root or meta regions?
After HADOOP-2478, we will be able to build a fault tolerant
upgrade utility, should HBase's file structure change again.
Additionally, we will be able to provide the equivalent of
fsck for HBase after HADOOP-2478.
---
Jim Kellerman, Senior Engineer; Powerset
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.13/1207 - Release Date:
1/2/2008
11:29 AM