Yoel Cabo Lopez created STORM-2219:
--------------------------------------
Summary: In HDFSBolt and SequenceFileBolt the files are overridden
if they already exist
Key: STORM-2219
URL: https://issues.apache.org/jira/browse/STORM-2219
Project: Apache Storm
Issue Type: Bug
Components: storm-hdfs
Reporter: Yoel Cabo Lopez
Priority: Critical
In both bolts the files are opened in create mode. That implies that if the
file already exists it is overridden. So, if for some reason the bolt is
restarted (rebalancing or some crash), the data is lost. I think that is
specially grave. What's more, since the rotation number is stored in memory,
all the files will be eventually wiped out.
I think there are two possible approaches:
- If the file already exists, open it in append mode. I see some problems here,
(1) the tuples data written to the several rotations will not keep its order
unless we jump to the last rotation, (2) the TimedRotationPolicy and other that
rely on memory stored data will not behave exactly as expected and (3) if the
case of the SequenceFileBolt, if the file has different compression code or
type it will raise an exception. Besides, we should change the way the
HDFSWriter handles the writing offset because it depends on the size of the
Tuples being written and not on the size of the file (and that would affect the
FileSizeRotationPolicy). This doesn't affect the SequenceFileWriter, since it
is using the getLength() method of SequenceFile.Writer that handles the append
mode properly.
- If the file exists, move to the next rotation. The problem I see is that if
the rotation number is not part of the file name it will enter in a endless
loop. Another issue is that if the the restart of the bolt is caused by some
problem that is not fixed after the restart, it could be creating new files
infinitely until collapsing the NameNode.
I guess the solution will be a mix of both approaches and I think I can be able
to implement it. But first I would like to ask if anyone has any other concern
about it.
By the moment I just wrote a bolt that satisfies my use case, with Sequence
Files opened in append mode if the file exists and rotating based on size. But
this solution should be more general.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)