Re: tab-delimited output

Alan Miller Thu, 13 May 2010 14:20:10 -0700

Thanks Alex,

For question 2, I was able to implement a Custom OutputFormat that
allows me to write some header lines to a file then write multiple
tab-delimited values per line like I wanted.


I had to "extend FileOutputFormat" and implement my own
write(),close() and getRecordWriter().

The 1st question is still open for me though. How to separate reducer
outputs based on a substring of the reducer's key.
In my Driver class I now use
  job.SetOutputFormatClass(MyOutputFormat.class)
so I can't use MultipleOutput.class to disect the outputs.

Is there a way to make my MyOutputFormat.class work like MultipleOutput?

The getRecordWriter calls job.getConfiguration() so could I do somethinglike:set a new filename in my reduce() via conf.set("fileprefix","2010-05-01_day");

  read the new filename in getRecordWriter() via conf.get("fileprefix");


Alan

On 05/13/2010 12:29 AM, Alex Kozlov wrote:

Hi Alan,
Unless you run your job with a single reducer you will not be able todo this. Think scalable: you should always add '-r-NNNNN' to the endto allow for multiple reducers and you can use custom partitioner tomake sure each host goes to a single reducer. MultipleOutputs can dothe rest, meaning the 'YYYY-MM-DD' prefix. 2 looks like a simpleaggregation job: the key should be the host name, and you need just toaggregate the values for each host x YYYY-MM-DD pair and write theminto separate 'YYYY-MM-DD-r-NNNNN' files. You can also do secondarysort to make sure the YYYY-MM-DD values come in order: this way you donot need to aggregate them in memory. See Reducer.java<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html>for details.
Alex K
On Wed, May 12, 2010 at 3:04 PM, Alan Miller <[email protected]<mailto:[email protected]>> wrote:
    Hi Alex,

    The tab isn't the issue (yet). I guess it's really 2 questions I have.
    Using the reducer inputs already mentioned.

    1. How do I generate multiple output files named YYYY-MM-DD.txt
    2. Each file should contain
         a. one line per host
         b. each line with host avg1 avg2 avg3 ....

    Alan


    On 05/12/2010 11:50 PM, Alex Kozlov wrote:
    Hi Alan,

    Is the problem that you want your 'value' vals to be tab
    separated?   This is entirely under control of your reducer.

    Alex K

    On Wed, May 12, 2010 at 2:07 PM, Alan Miller
    <[email protected] <mailto:[email protected]>> wrote:

        Hi all,

        How can I write tab-delimited output files from my reducer?

        My reducer gets Text/Text key/vals like:

        hostX_2010-05-01 varA=valA1,varB=valB1,varC=valC1
        hostX_2010-05-01 varA=valA2,varB=valB2,varC=valC2
        hostX_2010-05-01 varA=valA3,varB=valB3,varC=valC3
        ...
        hostY_2010-05-01 varA=valA1,varB=valB1,varC=valC1
        hostY_2010-05-01 varA=valA2,varB=valB2,varC=valC2
        hostY_2010-05-01 varA=valA3,varB=valB3,varC=valC3
        ...

        After my reducer calcs the daily averages of varA,B,C
        I  want to write a tab-delimited file with lines like:

        hostX    varA-Avg    varB-Avg    varC-Avg    ....
        hostY    varA-Avg    varB-Avg    varC-Avg    ....


        Thanks,
        Alan

Re: tab-delimited output

Reply via email to