Hi!
I tried to run wordcount example with a 1,9 GB input file, on 6 Nodes.
Hadoop has split the job in: 31 maps and one reduce.
All the maps fail with this stacktrace
java.lang.NullPointerException
at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:2394)
I am part of a working group that is developing a Bigtable-like structured
storage system for Hadoop HDFS (see
http://wiki.apache.org/lucene-hadoop/Hbase).
I am interested in learning about large HDFS installations:
- How many nodes do you have in a cluster?
- How much data do you store in
We have a cluster of about 40 nodes, with about 14Tb of aggregate raw
storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in
HDFS, stored in probably 100-200k files.
To make things work for my tasks, I had to hash through a few different
tricks for dealing with large sets
Could you post some comparison data, since you have already done it.
--Konstantin
howard chen wrote:
such as Coda or GFS (RHEL) ,
i think their performance or features will be more mature?
I have run the random example to generate 10GB data, seems currently
HDFS is the bottomneck?
regards,
You need to set the input format of the second job. It defaults to
TextInputFormat which is why you are seeing it become text. Use a line
like below in the second job.
secondjob.setInputFormat(SequenceFileInputFormat.class);
secondjob.setInputKeyClass(Text.class);
For that to work, the output of the previous job will have to set to
SequenceFileOuputFormat.
Note that, unless there are no tab characters in the keys of the output from
the first job, there's no way to read the existing output accurately back
in.
On 2/2/07, Dennis Kubes [EMAIL PROTECTED]
On Feb 2, 2007, at 2:46 PM, Bryan A. P. Pendleton wrote:
Note that, unless there are no tab characters in the keys of the
output from
the first job, there's no way to read the existing output
accurately back
in.
*Sigh* That asymmetry in Text{In,Out}putFormat has bothered me for a
while
Yes, it would be nice to fix that at some point.
Possibly a shadow file that keeps track of the offset of each key/value in
the file (probably using Vint-encoded difference-from-last-value). The
existing output would be preserved, but someone reading the file could use
such a cheat sheet to