Jelez-

A point: S3 uses eventual concurrency. This means that if you delete an S3
file (directories don't really exist) and then list the file again, it
might be there. Because the S3 node that handled the second request did not
learn about the first request yet.

A second point: in data management you want everything to be repeatable. By
reusing one S3 key for different data over time, you remove the ability to
re-run old tasks. This is a bad idea. I would add the day/time to the S3
key somewhere and add a separate DAG to clear out old S3 buckets.

Cheers,

Lance

On Mon, May 16, 2016 at 8:43 AM, Chris Riccomini <[email protected]>
wrote:

> Hey Jelez,
>
> Based on your stack trace, it sounds like you're using S3 as an HDFS
> replacement for Hadoop. S3, by default, will allow you to overwrite a
> file--your T2 shouldn't have an issue if it's using S3 directly:
>
>
> http://stackoverflow.com/questions/9517198/can-i-update-an-existing-amazon-s3-object
>
> However, given that you're interacting with S3 through Hadoop, it looks to
> me like it's Hadoop that's preventing you from overwriting.
>
> I am not terribly familiar with Scoop, but perhaps they have an "overwrite"
> option? If not, then I can't really think of a way to handle this that's
> idempotent, unless you couple the two operations together in a bash script,
> as you described. Perhaps someone else has some ideas.
>
> Cheers,
> Chris
>
> On Mon, May 16, 2016 at 8:25 AM, Raditchkov, Jelez (ETW) <
> [email protected]> wrote:
>
> > Thank you Chris!
> >
> > I wanted to keep all tasks within the DAG so it is transparent and seems
> > "the right" way to do. That is I have separate tasks for clean up and
> > separate for executing sqoop.
> >
> > If I understand your response correctly I have to make a bash or python
> > wrapper script that deletes the S3 file and then runs sqoop i.e combine
> T1
> > and T2. This seems like hacky to me - in a way that those are different
> > functionalities by different type of operators. By this logic I can just
> > combine all my tasks into a single script and have a DAG of a single
> task.
> >
> > Please, advise if I am getting something wrong.
> >
> >
> >
> > -----Original Message-----
> > From: Chris Riccomini [mailto:[email protected]]
> > Sent: Monday, May 16, 2016 7:43 AM
> > To: [email protected]
> > Subject: Re: Hadoop tasks - File Already Exists Exception
> >
> > Hey Jelez,
> >
> > The recommended way to handle this is to make your tasks idempotent. T2
> > should overwrite the S3 file, not fail if it already exists.
> >
> > Cheers,
> > Chris
> >
> > On Sun, May 15, 2016 at 11:42 AM, Raditchkov, Jelez (ETW) <
> > [email protected]> wrote:
> >
> > > I am running several dependent tasks:
> > > T1 - delete S3 folder for
> > > T2 - scoop from DB to the S3 folder
> > >
> > > Problem if T2 fails in the middle every retry then gets: Encountered
> > > IOException running import job:
> > > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> > > s3://...
> > >
> > > Is there a way reattempt a group of tasks not only the T2 - the way it
> > > is now the DAG fails because of S3 folder exists when it was created
> > > by the failed T2 attempt and the DAG can never succeed.
> > >
> > > Any suggestions?
> > >
> > > Thanks!
> > >
> > >
> >
>



-- 
Lance Norskog
[email protected]
Redwood City, CA

Reply via email to