Re: How to write one file per key as mapreduce output

Lincoln Ritter Wed, 23 Jul 2008 12:02:46 -0700

Thanks for the responses!

James said:
> do you know the maximum number of keys?


No.  I suppose I could compute the number of keys in a separate pass
but that seems pretty icky.

Jason said:
> Where fs is a FileSystem object available via the getFileSystem(conf) method 
> of Path.
>  FSDataOutputStream out = fs.create( destinationFile );
> then write to your out as normal then close it at the end of your reduce body.

This seems very straightforward, but also seems to work outside of the
typical M/R framework; the files created are essentially side effects
and not the "actual" output of the job.  This doesn't seem very clean
to me, but perhaps this is my somewhat shaky understanding of the
paradigm showing through.

Alejandro said:
> Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip)

I'm muddling through both
http://issues.apache.org/jira/browse/HADOOP-2906 and
http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
of these.  I'm a little confused by the way this works but it looks
like I can define a number of named outputs which looks like it
enables different output formats and I can also define some of these
as "multi", meaning that I can write to different "targets" (like
files).  Is this correct?

My current test looks like this (Note that I am very new to this so if
I am doing something dumb, please point it out so I can learn):

setup:

    job.addInputPath(new Path(segment, Content.DIR_NAME));

    job.setInputFormat(SequenceFileInputFormat.class);
    job.setMapperClass(InputCompatMapper.class);
    job.setReducerClass(TestMapreduce.class);

    job.setOutputPath(output);
    job.setOutputFormat(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(NutchWritable.class);

    MultipleOutputs.addMultiNamedOutput(job, "text",
TextOutputFormat.class, Text.class, Text.class);

reduce:

    public void reduce(WritableComparable key, Iterator<NutchWritable>
values, OutputCollector<WritableComparable,Writable> output, Reporter
reporter) throws IOException {
        ...
        mos.getCollector("text", sha, reporter).collect(null, new
Text(data.toString()));
    }

(mos is a MultipleOutputs set in configure(...), and sha is a String)

this seems to have mostly the desired effect, populating my output
directory with files named like
'text_0fe41fb5598a86b6b9f9a7181722a20cba6-r-00000' as well as an empty
'part-00000' file.

A couple of questions:

 - I needed to pass 'null' to the collect method so as to not write
the key to the file.  These files are meant to be consumable chunks of
content so I want to control exactly what goes into them.  Does this
seem normal or am i missing something?  Is there a downside to passing
null here?

 - What is the 'part-00000' file for?  I have seen this in other
places in the dfs. But it seems extraneous here.  It's not super
critical but if I can make it go away that would be great.

 - What is the purpose of the '-r-00000' suffix?  Perhaps it is to
help with collisions?  I guess it seems strange that I can't just say
"the output file should be called X" and have an output file called X
appear. I certainly want this process to be as robust as possible, but
I also would like to be able to make this as clean as possible.  If,
say, I can run this job and have it output a bunch of <name>.<ext>
files to an S3native fs directly that would be swell, though certainly
I can make this happen in a multi-step process.  Anybody have more
info on this or other ideas?

Thanks so much!  This community is really great and helpful!

-lincoln

--
lincolnritter.com



On Wed, Jul 23, 2008 at 9:07 AM, James Moore <[EMAIL PROTECTED]> wrote:
> On Tue, Jul 22, 2008 at 5:04 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Greetings,
>>
>> I would like to write one file per key in the reduce (or map) phase of a
>> mapreduce job.  I have looked at the documentation for
>> FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
>> how to use it/them.  Can anybody give me a quick pointer?
>
> One way to cheat for the reduce part of this - do you know the maximum
> number of keys?  If so, I think you should be able to just set the
> number of reducers to >= the maximum number of keys.
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>

Re: How to write one file per key as mapreduce output

Reply via email to