Yoel Cabo Lopez created STORM-2219:
--------------------------------------

             Summary: In HDFSBolt and SequenceFileBolt the files are overridden 
if they already exist
                 Key: STORM-2219
                 URL: https://issues.apache.org/jira/browse/STORM-2219
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-hdfs
            Reporter: Yoel Cabo Lopez
            Priority: Critical


In both bolts the files are opened in create mode. That implies that if the 
file already exists it is overridden. So, if for some reason the bolt is 
restarted (rebalancing or some crash), the data is lost. I think that is 
specially grave. What's more, since the rotation number is stored in memory, 
all the files will be eventually wiped out.

I think there are two possible approaches:
- If the file already exists, open it in append mode. I see some problems here, 
(1) the tuples data written to the several rotations will not keep its order 
unless we jump to the last rotation, (2) the TimedRotationPolicy and other that 
rely on memory stored data will not behave exactly as expected and (3) if the 
case of the SequenceFileBolt, if the file has different compression code or 
type it will raise an exception. Besides, we should change the way the 
HDFSWriter handles the writing offset because it depends on the size of the 
Tuples being written and not on the size of the file (and that would affect the 
FileSizeRotationPolicy). This doesn't affect the SequenceFileWriter, since it 
is using the getLength() method of SequenceFile.Writer that handles the append 
mode properly.
- If the file exists, move to the next rotation. The problem I see is that if 
the rotation number is not part of the file name it will enter in a endless 
loop. Another issue is that if the the restart of the bolt is caused by some 
problem that is not fixed after the restart, it could be creating new files 
infinitely until collapsing the NameNode.

I guess the solution will be a mix of both approaches and I think I can be able 
to implement it. But first I would like to ask if anyone has any other concern 
about it.

By the moment I just wrote a bolt that satisfies my use case, with Sequence 
Files opened in append mode if the file exists and rotating based on size. But 
this solution should be more general. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to