[
https://issues.apache.org/jira/browse/FLUME-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110083#comment-14110083
]
Bijith Kumar commented on FLUME-2445:
-------------------------------------
Hi Everyone,
I spend considerable time testing this. In short, there are two key issues.
1. S3 upload fails sporadically with HDFS sink (multiple failures)
2. No retries if sink to S3 fails for any reason.
#2 makes it a blocker for anyone using Flume for sinking to S3 (As upload to S3
can fail anytime).
I would like to contribute if it is reasonable effort. Can anyone guide?
Meanwhile I am going to use File Roll sink and custom S3 uploader :(
> Duplicate files in s3 (both temp and final file)
> ------------------------------------------------
>
> Key: FLUME-2445
> URL: https://issues.apache.org/jira/browse/FLUME-2445
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.5.0
> Reporter: Bijith Kumar
>
> Noticed that both temp and final file are created in S3 bucket by HDFS sink
> as shown below
> -rw-rw-rw- 1 9558423 2014-08-18 18:01
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz
> -rw-rw-rw- 1 9558423 2014-08-18 18:01
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> I could not find any errors in agent log. However, the agent tried to close
> and rename the temp file again when I tried to restart the agent next day.
> Even after second try, both file exists.
> Please find the logs below. File uploaded on Aug 18 and agent restarted on
> 19th
> $ grep actions-i-e9b26de6.1408381201580 logs/flume.log
> 18 Aug 2014 17:00:01,591 INFO
> [SinkRunner-PollingRunner-DefaultSinkProcessor]
> (org.apache.flume.sink.hdfs.BucketWriter.open:261) - Creating
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> 18 Aug 2014 17:00:02,150 INFO [hdfs-s3sink-actions-call-runner-1]
> (org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>:182)
> - OutputStream for key
> 'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp'
> writing to tempfile
> '/var/lib/hadoop-hdfs/cache/ec2-user/s3/output-1521416101446161225.tmp'
> 18 Aug 2014 18:01:02,535 INFO [hdfs-s3sink-actions-roll-timer-0]
> (org.apache.flume.sink.hdfs.BucketWriter$5.call:469) - Closing idle
> bucketWriter
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> at 1408384862535
> 18 Aug 2014 18:01:02,535 INFO [hdfs-s3sink-actions-roll-timer-0]
> (org.apache.flume.sink.hdfs.BucketWriter.close:409) - Closing
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> 18 Aug 2014 18:01:02,535 INFO [hdfs-s3sink-actions-call-runner-7]
> (org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close:217)
> - OutputStream for key
> 'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp'
> closed. Now beginning upload
> 18 Aug 2014 18:01:08,043 INFO [hdfs-s3sink-actions-call-runner-7]
> (org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close:229)
> - OutputStream for key
> 'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp'
> upload complete
> 18 Aug 2014 18:01:08,165 INFO [hdfs-s3sink-actions-call-runner-8]
> (org.apache.flume.sink.hdfs.BucketWriter$8.call:669) - Renaming
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> to
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz
> 19 Aug 2014 19:55:37,635 INFO [conf-file-poller-0]
> (org.apache.flume.sink.hdfs.BucketWriter.close:409) - Closing
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> 19 Aug 2014 19:55:37,635 INFO [conf-file-poller-0]
> (org.apache.flume.sink.hdfs.BucketWriter.close:428) - HDFSWriter is already
> closed:
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> 19 Aug 2014 19:55:38,064 INFO [hdfs-s3sink-actions-call-runner-1]
> (org.apache.flume.sink.hdfs.BucketWriter$8.call:669) - Renaming
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
> to
> s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz
--
This message was sent by Atlassian JIRA
(v6.2#6252)