Thank you Chris! I wanted to keep all tasks within the DAG so it is transparent and seems "the right" way to do. That is I have separate tasks for clean up and separate for executing sqoop.
If I understand your response correctly I have to make a bash or python wrapper script that deletes the S3 file and then runs sqoop i.e combine T1 and T2. This seems like hacky to me - in a way that those are different functionalities by different type of operators. By this logic I can just combine all my tasks into a single script and have a DAG of a single task. Please, advise if I am getting something wrong. -----Original Message----- From: Chris Riccomini [mailto:[email protected]] Sent: Monday, May 16, 2016 7:43 AM To: [email protected] Subject: Re: Hadoop tasks - File Already Exists Exception Hey Jelez, The recommended way to handle this is to make your tasks idempotent. T2 should overwrite the S3 file, not fail if it already exists. Cheers, Chris On Sun, May 15, 2016 at 11:42 AM, Raditchkov, Jelez (ETW) < [email protected]> wrote: > I am running several dependent tasks: > T1 - delete S3 folder for > T2 - scoop from DB to the S3 folder > > Problem if T2 fails in the middle every retry then gets: Encountered > IOException running import job: > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > s3://... > > Is there a way reattempt a group of tasks not only the T2 - the way it > is now the DAG fails because of S3 folder exists when it was created > by the failed T2 attempt and the DAG can never succeed. > > Any suggestions? > > Thanks! > >
