java.lang.NullPointerException im Sortetr/MergeQueue

2007-02-02 Thread Ion Badita
Hi! I tried to run wordcount example with a 1,9 GB input file, on 6 Nodes. Hadoop has split the job in: 31 maps and one reduce. All the maps fail with this stacktrace java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:2394)

Large data sets

2007-02-02 Thread Jim Kellerman
I am part of a working group that is developing a Bigtable-like structured storage system for Hadoop HDFS (see http://wiki.apache.org/lucene-hadoop/Hbase). I am interested in learning about large HDFS installations: - How many nodes do you have in a cluster? - How much data do you store in

Re: Large data sets

2007-02-02 Thread Bryan A. P. Pendleton
We have a cluster of about 40 nodes, with about 14Tb of aggregate raw storage.. At peak times, I have had up to 3 or 4 terabytes of data stored in HDFS, stored in probably 100-200k files. To make things work for my tasks, I had to hash through a few different tricks for dealing with large sets

Re: Why don't use existing DFS?

2007-02-02 Thread Konstantin Shvachko
Could you post some comparison data, since you have already done it. --Konstantin howard chen wrote: such as Coda or GFS (RHEL) , i think their performance or features will be more mature? I have run the random example to generate 10GB data, seems currently HDFS is the bottomneck? regards,

Re: SequenceFile (Text,Text) becomes plain text

2007-02-02 Thread Dennis Kubes
You need to set the input format of the second job. It defaults to TextInputFormat which is why you are seeing it become text. Use a line like below in the second job. secondjob.setInputFormat(SequenceFileInputFormat.class); secondjob.setInputKeyClass(Text.class);

Re: SequenceFile (Text,Text) becomes plain text

2007-02-02 Thread Bryan A. P. Pendleton
For that to work, the output of the previous job will have to set to SequenceFileOuputFormat. Note that, unless there are no tab characters in the keys of the output from the first job, there's no way to read the existing output accurately back in. On 2/2/07, Dennis Kubes [EMAIL PROTECTED]

Re: SequenceFile (Text,Text) becomes plain text

2007-02-02 Thread Owen O'Malley
On Feb 2, 2007, at 2:46 PM, Bryan A. P. Pendleton wrote: Note that, unless there are no tab characters in the keys of the output from the first job, there's no way to read the existing output accurately back in. *Sigh* That asymmetry in Text{In,Out}putFormat has bothered me for a while

Re: SequenceFile (Text,Text) becomes plain text

2007-02-02 Thread Bryan A. P. Pendleton
Yes, it would be nice to fix that at some point. Possibly a shadow file that keeps track of the offset of each key/value in the file (probably using Vint-encoded difference-from-last-value). The existing output would be preserved, but someone reading the file could use such a cheat sheet to