Re: What is the best way to use the Hadoop output data

Chris Curtin Fri, 26 Jun 2009 11:51:11 -0700

Hi,

have you looked at Cascading? It gives you events (
http://www.cascading.org/userguide/htmlsingle/#N20ADF) that can call back to
your code when one of your MR jobs completes. It is also embeddable inside a
daemon so you can launch a thread that runs the MR job, then writes to the
external end point. It handles all the Hadoop stuff for you so you don't
need to worry about tracking dependencies.


On Fri, Jun 26, 2009 at 4:10 AM, Zhong Wang <wangzhong....@gmail.com> wrote:

> Hi Huy,
>
> On Thu, Jun 25, 2009 at 6:02 PM, Huy Phan<dac...@gmail.com> wrote:
> > I'm wondering if there's any performance killer in this approach, I
> posted
> > the question to IRC channel and someone told me that there may be a
> > bottleneck.
>
> There may be some communication errors to block your MapReduce job
> when you post your output data. So I think it's better to do this
> after the job is done.
>
> > I wonder if there is any way to spawn a process directly from Hadoop
> after
> > all the MapReduce tasks finish ?
> >
>
> How do you submit your jobs? You can block the job submit process by
> running job using job.waitForCompletion(true) in your main driver
> class. Then the two processes are synchronous.
>
>
> --
> Zhong Wang
>

Re: What is the best way to use the Hadoop output data

Reply via email to