Re: Custom FileOutputFormat / RecordWriter

Tom Melendez Tue, 26 Jul 2011 08:52:04 -0700

Hi Harsh,

Cool, thanks for the details.  For anyone interested, with your tip
and description I was able to find an example inside the "Hadoop in
Action" (Chapter 7, p168) book.


Another question, though, it doesn't look like MultipleOutputs will
let me control the filename in a per-key (per map) manner.  So,
basically, if my map receives a key of "mykey", I want my file to be
"mykey-someotherstuff.foo" (this is a binary file).  Am I right about
this?

Thanks,

Tom

On Tue, Jul 26, 2011 at 1:34 AM, Harsh J <[email protected]> wrote:
> Tom,
>
> What I meant to say was that doing this is well supported with
> existing API/libraries itself:
>
> - The class MultipleOutputs supports providing a filename for an
> output. See MultipleOutputs.addNamedOutput usage [1].
> - The type 'NullWritable' is a special writable that doesn't do
> anything. So if its configured into the above filename addition as a
> key-type, and you pass NullWritable.get() as the key in every write
> operation, you will end up just writing the value part of (key,
> value).
> - This way you do not have to write a custom OutputFormat for your use-case.
>
> [1] - 
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
> (Also available for the new API, depending on which
> version/distribution of Hadoop you are on)
>
> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <[email protected]> wrote:
>> Hi Harsh,
>>
>> Thanks for the response.  Unfortunately, I'm not following your response.  
>> :-)
>>
>> Could you elaborate a bit?
>>
>> Thanks,
>>
>> Tom
>>
>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <[email protected]> wrote:
>>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>>> Your sinking Key can be of NullWritable type, and you can keep passing
>>> an instance of NullWritable.get() to it in every cycle. This would
>>> write just the value, while the filenames are added/sourced from the
>>> key inside the mapper code.
>>>
>>> This, if you are not comfortable writing your own code and maintaining
>>> it, I s'pose. Your approach is correct as well, if the question was
>>> specifically that.
>>>
>>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <[email protected]> wrote:
>>>> Hi Folks,
>>>>
>>>> Just doing a sanity check here.
>>>>
>>>> I have a map-only job, which produces a filename for a key and data as
>>>> a value.  I want to write the value (data) into the key (filename) in
>>>> the path specified when I run the job.
>>>>
>>>> The value (data) doesn't need any formatting, I can just write it to
>>>> HDFS without modification.
>>>>
>>>> So, looking at this link (the Output Formats section):
>>>>
>>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>>
>>>> Looks like I want to:
>>>> - create a new output format
>>>> - override write, tell it not to call writekey as I don't want that written
>>>> - new getRecordWriter method that use the key as the filename and
>>>> calls my outputformat
>>>>
>>>> Sound reasonable?
>>>>
>>>> Thanks,
>>>>
>>>> Tom
>>>>
>>>> --
>>>> ===================
>>>> Skybox is hiring.
>>>> http://www.skyboximaging.com/careers/jobs
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> ===================
>> Skybox is hiring.
>> http://www.skyboximaging.com/careers/jobs
>>
>
>
>
> --
> Harsh J
>



-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Reply via email to