David,

While ZK can solve this, locking may only make you slower. Lets try to
keep it simple?

Have you considered keeping two directories? One where the older data
is moved to (by the first job, instead of replacing files), for
consumption by the second job, which triggers by watching this
directory?

That is,
MR Job #1 (the producer), moves existing data to /path/b/timestamp,
and writes new data to /path/a.
MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
of available set of timestamps under /path/b at that point) for its
input, and deletes it afterwards. Hence the #2 can monitor this
directory for triggering itself.

On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg <ginz...@hotmail.com> wrote:
> Hi,
>
> I have an HDFS folder and M/R job that periodically updates it by replacing
> the data with newly generated data.
>
> I have a different M/R job that periodically or ad-hoc process the data in
> the folder.
>
> The second job ,naturally, fails sometime, when the data is replaced by
> newly generated data and the job plan including the input paths have already
> been submitted.
>
> Is there an elegant solution ?
>
> My current though is to query the jobtracker for running jobs and go over
> all the input files, in the job XML to know if The swap should block until
> the input path is no longer in any current executed input path job.
>
>
>
>



-- 
Harsh J

Reply via email to