Thank you Chris! 

I wanted to keep all tasks within the DAG so it is transparent and seems "the 
right" way to do. That is I have separate tasks for clean up and separate for 
executing sqoop. 

If I understand your response correctly I have to make a bash or python wrapper 
script that deletes the S3 file and then runs sqoop i.e combine T1 and T2. This 
seems like hacky to me - in a way that those are different functionalities by 
different type of operators. By this logic I can just combine all my tasks into 
a single script and have a DAG of a single task.

Please, advise if I am getting something wrong.



-----Original Message-----
From: Chris Riccomini [mailto:[email protected]] 
Sent: Monday, May 16, 2016 7:43 AM
To: [email protected]
Subject: Re: Hadoop tasks - File Already Exists Exception

Hey Jelez,

The recommended way to handle this is to make your tasks idempotent. T2 should 
overwrite the S3 file, not fail if it already exists.

Cheers,
Chris

On Sun, May 15, 2016 at 11:42 AM, Raditchkov, Jelez (ETW) < 
[email protected]> wrote:

> I am running several dependent tasks:
> T1 - delete S3 folder for
> T2 - scoop from DB to the S3 folder
>
> Problem if T2 fails in the middle every retry then gets: Encountered 
> IOException running import job:
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> s3://...
>
> Is there a way reattempt a group of tasks not only the T2 - the way it 
> is now the DAG fails because of S3 folder exists when it was created 
> by the failed T2 attempt and the DAG can never succeed.
>
> Any suggestions?
>
> Thanks!
>
>

Reply via email to