Bijith Kumar created FLUME-2445:
-----------------------------------

             Summary: Duplicate files in s3 (both temp and final file)
                 Key: FLUME-2445
                 URL: https://issues.apache.org/jira/browse/FLUME-2445
             Project: Flume
          Issue Type: Bug
          Components: Sinks+Sources
    Affects Versions: v1.5.0
            Reporter: Bijith Kumar


Noticed that both temp and final file are created in S3 bucket by HDFS sink as 
shown below
-rw-rw-rw-   1    9558423 2014-08-18 18:01 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz
-rw-rw-rw-   1    9558423 2014-08-18 18:01 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp

I could not find any errors in agent log. However, the agent tried to close and 
rename the temp file again when I tried to restart the agent next day. Even 
after second try, both file exists. 
Please find the logs below. File uploaded on Aug 18 and agent restarted on 19th

$ grep actions-i-e9b26de6.1408381201580 logs/flume.log 
18 Aug 2014 17:00:01,591 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] 
(org.apache.flume.sink.hdfs.BucketWriter.open:261)  - Creating 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
18 Aug 2014 17:00:02,150 INFO  [hdfs-s3sink-actions-call-runner-1] 
(org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>:182)
  - OutputStream for key 
'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp' 
writing to tempfile 
'/var/lib/hadoop-hdfs/cache/ec2-user/s3/output-1521416101446161225.tmp'
18 Aug 2014 18:01:02,535 INFO  [hdfs-s3sink-actions-roll-timer-0] 
(org.apache.flume.sink.hdfs.BucketWriter$5.call:469)  - Closing idle 
bucketWriter 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
 at 1408384862535
18 Aug 2014 18:01:02,535 INFO  [hdfs-s3sink-actions-roll-timer-0] 
(org.apache.flume.sink.hdfs.BucketWriter.close:409)  - Closing 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
18 Aug 2014 18:01:02,535 INFO  [hdfs-s3sink-actions-call-runner-7] 
(org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close:217)
  - OutputStream for key 
'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp' 
closed. Now beginning upload
18 Aug 2014 18:01:08,043 INFO  [hdfs-s3sink-actions-call-runner-7] 
(org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close:229)
  - OutputStream for key 
'flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp' 
upload complete
18 Aug 2014 18:01:08,165 INFO  [hdfs-s3sink-actions-call-runner-8] 
(org.apache.flume.sink.hdfs.BucketWriter$8.call:669)  - Renaming 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
 to 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz

19 Aug 2014 19:55:37,635 INFO  [conf-file-poller-0] 
(org.apache.flume.sink.hdfs.BucketWriter.close:409)  - Closing 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
19 Aug 2014 19:55:37,635 INFO  [conf-file-poller-0] 
(org.apache.flume.sink.hdfs.BucketWriter.close:428)  - HDFSWriter is already 
closed: 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
19 Aug 2014 19:55:38,064 INFO  [hdfs-s3sink-actions-call-runner-1] 
(org.apache.flume.sink.hdfs.BucketWriter$8.call:669)  - Renaming 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz.tmp
 to 
s3n://my-bucket/flume/actions/day=16300/hour=17/actions-i-e9b26de6.1408381201580.json.gz



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to