Re: question on file, inputformats and outputformats

Ted Dunning Mon, 17 Dec 2007 18:21:29 -0800


Part of your problem is that you appear to be using a TextInputFormat (the
default input format).  The TIF produces keys that are LongWritable and
values that are Text.

Other input formats produce different types.

With recent versions of hadoop, classes that extend InputFormatBase can (and
I think should) use templates to describe their output types.  Similarly,
classes extending MapReduceBase and OutputFormat can specify input/output
classes and output classes respectively.

I have added more specific comments in-line.

On 12/17/07 5:40 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:

> 1. Pass in a string to my hadoop program, and it will write this
> single key-value pair to a file on the fly.

How is your string a key-value pair?

Assuming that you have something as simple as tab-delimited text, you may
not need to do anything at all other than just copy this data into hadoop.

> 2. The first job will read from this file, do some processing, and
> write more key-value pairs to other files (the same format as the file
> in step 1). Subsequent jobs will read from those files generated by
> the first job. This will continue in an iterative manner until some
> terminal condition has reached.

Can you be more specific?

Let's assume that you are reading tab-delimited data.  You should set the
input format:

        conf.setInputFormat(TextInputFormat.class);

Then, since the output of your map will have a string key and value, you
should tell the system this:

       step1.setOutputKeyClass(Text.class);
       step1.setOutputValueClass(Text.class);

Note that the signature on your map function should be:

   public static class JoinMap extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {
            ...

        public void map(LongWritable k, Text input,
                        OutputCollector<Text, Text> output,
                        Reporter reporter) throws IOException {
            String[] parts = input.split("\t");

            Text key, result;
                ...
            output.collect(key, result);
        }
    }

And your reduce should look something like this:

    public static class JoinReduce extends MapReduceBase implements
            Reducer<Text, Text, Text, Mumble> {

        public void reduce(Text k, Iterator<Text> values,
                           OutputCollector<Text, Mumble> output,
                           Reporter reporter) throws IOException {
            Text key;
            Mumble result;
                ....
            output.collect(key, result);
        }
    }

> KeyValueTextInputFormat looks promising

This could work, depending on what data you have for input.  Set the
separator byte to be whatever separates your key from your value and off you
go.

Re: question on file, inputformats and outputformats

Reply via email to