Hi Harsh, Cool, thanks for the details. For anyone interested, with your tip and description I was able to find an example inside the "Hadoop in Action" (Chapter 7, p168) book.
Another question, though, it doesn't look like MultipleOutputs will let me control the filename in a per-key (per map) manner. So, basically, if my map receives a key of "mykey", I want my file to be "mykey-someotherstuff.foo" (this is a binary file). Am I right about this? Thanks, Tom On Tue, Jul 26, 2011 at 1:34 AM, Harsh J <[email protected]> wrote: > Tom, > > What I meant to say was that doing this is well supported with > existing API/libraries itself: > > - The class MultipleOutputs supports providing a filename for an > output. See MultipleOutputs.addNamedOutput usage [1]. > - The type 'NullWritable' is a special writable that doesn't do > anything. So if its configured into the above filename addition as a > key-type, and you pass NullWritable.get() as the key in every write > operation, you will end up just writing the value part of (key, > value). > - This way you do not have to write a custom OutputFormat for your use-case. > > [1] - > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html > (Also available for the new API, depending on which > version/distribution of Hadoop you are on) > > On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <[email protected]> wrote: >> Hi Harsh, >> >> Thanks for the response. Unfortunately, I'm not following your response. >> :-) >> >> Could you elaborate a bit? >> >> Thanks, >> >> Tom >> >> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <[email protected]> wrote: >>> You can use MultipleOutputs (or MultiTextOutputFormat for direct >>> key-file mapping, but I'd still prefer the stable MultipleOutputs). >>> Your sinking Key can be of NullWritable type, and you can keep passing >>> an instance of NullWritable.get() to it in every cycle. This would >>> write just the value, while the filenames are added/sourced from the >>> key inside the mapper code. >>> >>> This, if you are not comfortable writing your own code and maintaining >>> it, I s'pose. Your approach is correct as well, if the question was >>> specifically that. >>> >>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <[email protected]> wrote: >>>> Hi Folks, >>>> >>>> Just doing a sanity check here. >>>> >>>> I have a map-only job, which produces a filename for a key and data as >>>> a value. I want to write the value (data) into the key (filename) in >>>> the path specified when I run the job. >>>> >>>> The value (data) doesn't need any formatting, I can just write it to >>>> HDFS without modification. >>>> >>>> So, looking at this link (the Output Formats section): >>>> >>>> http://developer.yahoo.com/hadoop/tutorial/module5.html >>>> >>>> Looks like I want to: >>>> - create a new output format >>>> - override write, tell it not to call writekey as I don't want that written >>>> - new getRecordWriter method that use the key as the filename and >>>> calls my outputformat >>>> >>>> Sound reasonable? >>>> >>>> Thanks, >>>> >>>> Tom >>>> >>>> -- >>>> =================== >>>> Skybox is hiring. >>>> http://www.skyboximaging.com/careers/jobs >>>> >>> >>> >>> >>> -- >>> Harsh J >>> >> >> >> >> -- >> =================== >> Skybox is hiring. >> http://www.skyboximaging.com/careers/jobs >> > > > > -- > Harsh J > -- =================== Skybox is hiring. http://www.skyboximaging.com/careers/jobs
