Re: question on file, inputformats and outputformats

Jim the Standing Bear Mon, 17 Dec 2007 18:47:15 -0800

Hi Ted,

Yes, I got quite confused and picked TextInputFormat because I thought
it would be easy to understand.


To be more specific on what I am trying to do:

I pass in the path to a directory (say "/usr/mydir/bigtree").  The
code writes this to a file:  DIR <TAB> /usr/mydir/bigtree

The job will read data from the file, and if it gets a DIR, it will
walk into it, and list everything that directory has, and write the
contents to another file.  sub-directories will have "DIR" as their
keys, and files will have "FILE".  Then the same job configuration
will read off the new data file, and do the same thing again and
again, until there is no more directories to be walked.  So in the
end, there should be a file containing all the files under a directory
(not necessarily directly under).

Now that you told me about the generics, I am hoping the reason
sequence file didn't work for me because I didn't set the correct
type. I shall try that again.

With KeyValueTextInputFormat, the problem is not reading it - I know
how to set the separator byte and all that... my problem is with
creating the very first file - I simply don't know how.  I can use
SequenceFile.Writer to write the key and value, but the file contains
a header, some funny-looking separator and sync bytes.  If I simply
want a file containing clean Key<Text>\tValue<Text>, I dont know what
kind of Writer to use to create it.  Do you know of a way?    Thanks.

-- Jim

On Dec 17, 2007 9:01 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
> Part of your problem is that you appear to be using a TextInputFormat (the
> default input format).  The TIF produces keys that are LongWritable and
> values that are Text.
>
> Other input formats produce different types.
>
> With recent versions of hadoop, classes that extend InputFormatBase can (and
> I think should) use templates to describe their output types.  Similarly,
> classes extending MapReduceBase and OutputFormat can specify input/output
> classes and output classes respectively.
>
> I have added more specific comments in-line.
>
> On 12/17/07 5:40 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:
>
>
> > 1. Pass in a string to my hadoop program, and it will write this
> > single key-value pair to a file on the fly.
>
> How is your string a key-value pair?
>
> Assuming that you have something as simple as tab-delimited text, you may
> not need to do anything at all other than just copy this data into hadoop.
>
> > 2. The first job will read from this file, do some processing, and
> > write more key-value pairs to other files (the same format as the file
> > in step 1). Subsequent jobs will read from those files generated by
> > the first job. This will continue in an iterative manner until some
> > terminal condition has reached.
>
> Can you be more specific?
>
> Let's assume that you are reading tab-delimited data.  You should set the
> input format:
>
>         conf.setInputFormat(TextInputFormat.class);
>
> Then, since the output of your map will have a string key and value, you
> should tell the system this:
>
>        step1.setOutputKeyClass(Text.class);
>        step1.setOutputValueClass(Text.class);
>
> Note that the signature on your map function should be:
>
>    public static class JoinMap extends MapReduceBase
>     implements Mapper<LongWritable, Text, Text, Text> {
>             ...
>
>         public void map(LongWritable k, Text input,
>                         OutputCollector<Text, Text> output,
>                         Reporter reporter) throws IOException {
>             String[] parts = input.split("\t");
>
>             Text key, result;
>                 ...
>             output.collect(key, result);
>         }
>     }
>
> And your reduce should look something like this:
>
>     public static class JoinReduce extends MapReduceBase implements
>             Reducer<Text, Text, Text, Mumble> {
>
>         public void reduce(Text k, Iterator<Text> values,
>                            OutputCollector<Text, Mumble> output,
>                            Reporter reporter) throws IOException {
>             Text key;
>             Mumble result;
>                 ....
>             output.collect(key, result);
>         }
>     }
>
>
> > KeyValueTextInputFormat looks promising
>
> This could work, depending on what data you have for input.  Set the
> separator byte to be whatever separates your key from your value and off you
> go.
>
>
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Reply via email to