Robert Joseph Evans created STORM-837:
-----------------------------------------

             Summary: HdfsState ignores commits
                 Key: STORM-837
                 URL: https://issues.apache.org/jira/browse/STORM-837
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Robert Joseph Evans
            Priority: Critical


HdfsState works with trident which is supposed to provide exactly once 
processing.  It does this two ways, first by informing the state about commits 
so it can be sure the data is written out, and second by having a commit id, so 
that double commits can be handled.

HdfsState ignores the beginCommit and commit calls, and with that ignores the 
ids.  This means that if you use HdfsState and your worker crashes you may both 
lose data and get some data twice.

At a minimum the flush and file rotation should be tied to the commit in some 
way.  The commit ID should at a minimum be written out with the data so someone 
reading the data can have a hope of deduping it themselves.

Also with the rotationActions it is possible for a file that was partially 
written is leaked, and never moved to the final location, because it is not 
rotated.  I personally think the actions are too generic for this case and need 
to be deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to