Hi,

My problem is that some of the jobs that reads the folder are not under my 
control, i.e: a client submits a hive job.

I was thinking of something like an mv(source,target ,long timeout) which will 
block until the folder is not in used or time out is reached .

Is it possible that this problem is not a common one ?

> From: ha...@cloudera.com
> Date: Mon, 13 Aug 2012 17:33:02 +0530
> Subject: Re: Locks in M/R framework
> To: mapreduce-user@hadoop.apache.org
> 
> David,
> 
> While ZK can solve this, locking may only make you slower. Lets try to
> keep it simple?
> 
> Have you considered keeping two directories? One where the older data
> is moved to (by the first job, instead of replacing files), for
> consumption by the second job, which triggers by watching this
> directory?
> 
> That is,
> MR Job #1 (the producer), moves existing data to /path/b/timestamp,
> and writes new data to /path/a.
> MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
> of available set of timestamps under /path/b at that point) for its
> input, and deletes it afterwards. Hence the #2 can monitor this
> directory for triggering itself.
> 
> On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg <ginz...@hotmail.com> wrote:
> > Hi,
> >
> > I have an HDFS folder and M/R job that periodically updates it by replacing
> > the data with newly generated data.
> >
> > I have a different M/R job that periodically or ad-hoc process the data in
> > the folder.
> >
> > The second job ,naturally, fails sometime, when the data is replaced by
> > newly generated data and the job plan including the input paths have already
> > been submitted.
> >
> > Is there an elegant solution ?
> >
> > My current though is to query the jobtracker for running jobs and go over
> > all the input files, in the job XML to know if The swap should block until
> > the input path is no longer in any current executed input path job.
> >
> >
> >
> >
> 
> 
> 
> -- 
> Harsh J
                                          

Reply via email to