Re: Chaining M/R Jobs

Alex Kozlov Mon, 26 Apr 2010 12:01:00 -0700

You can use MultipleOutputs for this purpose, even though it was not
designed for this and a few people on this list are going to raise an
eyebrow.


Alex K

On Mon, Apr 26, 2010 at 11:39 AM, Xavier Stevens <[email protected]>wrote:

> I don't usually bother renaming the files.  If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job.  And then add those to distributed cache.  If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.
>
>
> -Xavier
>
>
> -----Original Message-----
> From: Eric Sammer [mailto:[email protected]]
> Sent: Monday, April 26, 2010 11:33 AM
> To: [email protected]
> Subject: Re: Chaining M/R Jobs
>
> The easiest way to do this is to write your job outputs to a known
> place and then use the FileSystem APIs to rename the part-* files to
> what you want them to be.
>
> On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <[email protected]>
> wrote:
> > Hi,
> >
> > I'm trying to find a way to control the output file names. I need this
> because I have a situation where I need to run a Job and then use it's
> output in the DistributedCache.
> >
> > So far the only way I've seen that makes it possible is rewriting the
> OutputFormat class but that seems a lot of work for such a simple task.
> Is there any way to do what I'm looking for?
> >
> > Tiago Veloso
> > [email protected]
> >
> >
> >
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>
>
>

Re: Chaining M/R Jobs

Reply via email to