Thanks for the responses! James said: > do you know the maximum number of keys?
No. I suppose I could compute the number of keys in a separate pass but that seems pretty icky. Jason said: > Where fs is a FileSystem object available via the getFileSystem(conf) method > of Path. > FSDataOutputStream out = fs.create( destinationFile ); > then write to your out as normal then close it at the end of your reduce body. This seems very straightforward, but also seems to work outside of the typical M/R framework; the files created are essentially side effects and not the "actual" output of the job. This doesn't seem very clean to me, but perhaps this is my somewhat shaky understanding of the paradigm showing through. Alejandro said: > Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip) I'm muddling through both http://issues.apache.org/jira/browse/HADOOP-2906 and http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense of these. I'm a little confused by the way this works but it looks like I can define a number of named outputs which looks like it enables different output formats and I can also define some of these as "multi", meaning that I can write to different "targets" (like files). Is this correct? My current test looks like this (Note that I am very new to this so if I am doing something dumb, please point it out so I can learn): setup: job.addInputPath(new Path(segment, Content.DIR_NAME)); job.setInputFormat(SequenceFileInputFormat.class); job.setMapperClass(InputCompatMapper.class); job.setReducerClass(TestMapreduce.class); job.setOutputPath(output); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NutchWritable.class); MultipleOutputs.addMultiNamedOutput(job, "text", TextOutputFormat.class, Text.class, Text.class); reduce: public void reduce(WritableComparable key, Iterator<NutchWritable> values, OutputCollector<WritableComparable,Writable> output, Reporter reporter) throws IOException { ... mos.getCollector("text", sha, reporter).collect(null, new Text(data.toString())); } (mos is a MultipleOutputs set in configure(...), and sha is a String) this seems to have mostly the desired effect, populating my output directory with files named like 'text_0fe41fb5598a86b6b9f9a7181722a20cba6-r-00000' as well as an empty 'part-00000' file. A couple of questions: - I needed to pass 'null' to the collect method so as to not write the key to the file. These files are meant to be consumable chunks of content so I want to control exactly what goes into them. Does this seem normal or am i missing something? Is there a downside to passing null here? - What is the 'part-00000' file for? I have seen this in other places in the dfs. But it seems extraneous here. It's not super critical but if I can make it go away that would be great. - What is the purpose of the '-r-00000' suffix? Perhaps it is to help with collisions? I guess it seems strange that I can't just say "the output file should be called X" and have an output file called X appear. I certainly want this process to be as robust as possible, but I also would like to be able to make this as clean as possible. If, say, I can run this job and have it output a bunch of <name>.<ext> files to an S3native fs directly that would be swell, though certainly I can make this happen in a multi-step process. Anybody have more info on this or other ideas? Thanks so much! This community is really great and helpful! -lincoln -- lincolnritter.com On Wed, Jul 23, 2008 at 9:07 AM, James Moore <[EMAIL PROTECTED]> wrote: > On Tue, Jul 22, 2008 at 5:04 PM, Lincoln Ritter > <[EMAIL PROTECTED]> wrote: >> Greetings, >> >> I would like to write one file per key in the reduce (or map) phase of a >> mapreduce job. I have looked at the documentation for >> FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on >> how to use it/them. Can anybody give me a quick pointer? > > One way to cheat for the reduce part of this - do you know the maximum > number of keys? If so, I think you should be able to just set the > number of reducers to >= the maximum number of keys. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
