Hi, My problem is that some of the jobs that reads the folder are not under my control, i.e: a client submits a hive job.
I was thinking of something like an mv(source,target ,long timeout) which will block until the folder is not in used or time out is reached . Is it possible that this problem is not a common one ? > From: ha...@cloudera.com > Date: Mon, 13 Aug 2012 17:33:02 +0530 > Subject: Re: Locks in M/R framework > To: mapreduce-user@hadoop.apache.org > > David, > > While ZK can solve this, locking may only make you slower. Lets try to > keep it simple? > > Have you considered keeping two directories? One where the older data > is moved to (by the first job, instead of replacing files), for > consumption by the second job, which triggers by watching this > directory? > > That is, > MR Job #1 (the producer), moves existing data to /path/b/timestamp, > and writes new data to /path/a. > MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole > of available set of timestamps under /path/b at that point) for its > input, and deletes it afterwards. Hence the #2 can monitor this > directory for triggering itself. > > On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg <ginz...@hotmail.com> wrote: > > Hi, > > > > I have an HDFS folder and M/R job that periodically updates it by replacing > > the data with newly generated data. > > > > I have a different M/R job that periodically or ad-hoc process the data in > > the folder. > > > > The second job ,naturally, fails sometime, when the data is replaced by > > newly generated data and the job plan including the input paths have already > > been submitted. > > > > Is there an elegant solution ? > > > > My current though is to query the jobtracker for running jobs and go over > > all the input files, in the job XML to know if The swap should block until > > the input path is no longer in any current executed input path job. > > > > > > > > > > > > -- > Harsh J