Sverre Bakke created FLUME-2458:
-----------------------------------

             Summary: Separate hdfs tmp directory for flume hdfs sink
                 Key: FLUME-2458
                 URL: https://issues.apache.org/jira/browse/FLUME-2458
             Project: Flume
          Issue Type: Improvement
          Components: Sinks+Sources
    Affects Versions: v1.5.0.1
            Reporter: Sverre Bakke
            Priority: Minor


The current HDFS sink will write temporary files to the same directory as the 
final file will be stored. This is a problem for several reasons:

1) File moving
When mapreduce fetches a list of files to be processed and then processes files 
that are then gone (i.e. are moved from .tmp to  whatever final name it is 
suppose to have), then the mapreduce job will crash.

2) File type
When mapreduce decides how to process files, then it looks at files extension. 
If using compressed files, then it will decompress it for you. If the file has 
a .tmp file extension (in the same folder) then it will treat a compressed file 
as an uncompressed files, thus breaking the results of the mapreduce job.

I propose that the sink gets an optional tmp path for storing these files to 
avoid these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to