Hey Jelez, Based on your stack trace, it sounds like you're using S3 as an HDFS replacement for Hadoop. S3, by default, will allow you to overwrite a file--your T2 shouldn't have an issue if it's using S3 directly:
http://stackoverflow.com/questions/9517198/can-i-update-an-existing-amazon-s3-object However, given that you're interacting with S3 through Hadoop, it looks to me like it's Hadoop that's preventing you from overwriting. I am not terribly familiar with Scoop, but perhaps they have an "overwrite" option? If not, then I can't really think of a way to handle this that's idempotent, unless you couple the two operations together in a bash script, as you described. Perhaps someone else has some ideas. Cheers, Chris On Mon, May 16, 2016 at 8:25 AM, Raditchkov, Jelez (ETW) < [email protected]> wrote: > Thank you Chris! > > I wanted to keep all tasks within the DAG so it is transparent and seems > "the right" way to do. That is I have separate tasks for clean up and > separate for executing sqoop. > > If I understand your response correctly I have to make a bash or python > wrapper script that deletes the S3 file and then runs sqoop i.e combine T1 > and T2. This seems like hacky to me - in a way that those are different > functionalities by different type of operators. By this logic I can just > combine all my tasks into a single script and have a DAG of a single task. > > Please, advise if I am getting something wrong. > > > > -----Original Message----- > From: Chris Riccomini [mailto:[email protected]] > Sent: Monday, May 16, 2016 7:43 AM > To: [email protected] > Subject: Re: Hadoop tasks - File Already Exists Exception > > Hey Jelez, > > The recommended way to handle this is to make your tasks idempotent. T2 > should overwrite the S3 file, not fail if it already exists. > > Cheers, > Chris > > On Sun, May 15, 2016 at 11:42 AM, Raditchkov, Jelez (ETW) < > [email protected]> wrote: > > > I am running several dependent tasks: > > T1 - delete S3 folder for > > T2 - scoop from DB to the S3 folder > > > > Problem if T2 fails in the middle every retry then gets: Encountered > > IOException running import job: > > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > > s3://... > > > > Is there a way reattempt a group of tasks not only the T2 - the way it > > is now the DAG fails because of S3 folder exists when it was created > > by the failed T2 attempt and the DAG can never succeed. > > > > Any suggestions? > > > > Thanks! > > > > >
