[ 
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated STORM-837:
--------------------------------------
    Comment: was deleted

(was: HdfsState is not a MapState.  It is just a State, there are no get 
operations supported on it, it is a sink that writes all input to HDFS.  The 
issue is with others reading the data.  The readers in this case are likely to 
be a batch job using a Hadoop input format to read the data.  For regular storm 
it provides at most once or at least once semantics, and the HdfsBolt, which is 
also a sink, provides the exact same semantics, so if there is duplicate data 
or lost data it could be for more reasons then just the Bolt not syncing things 
correctly.  In that case however, I would like to see the spout not ack a tuple 
until it has been synced to disk, that way we truly can be sure no data is 
lost, but that is another issue.

for trident we expect exactly once semantics, especially form something that 
comes as an official part of storm.  The File formats that the data is written 
out in are just a log with no ability to overwrite out of data data like a 
MapState can.  They also have no knowledge of zookeeper and the batch ids that 
have been or not been fully committed.  And even if they did have that 
knowledge the entries have to commit ID in them to let the reader know which 
ones it should ignore while reading.)

> HdfsState ignores commits
> -------------------------
>
>                 Key: STORM-837
>                 URL: https://issues.apache.org/jira/browse/STORM-837
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once 
> processing.  It does this two ways, first by informing the state about 
> commits so it can be sure the data is written out, and second by having a 
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the 
> ids.  This means that if you use HdfsState and your worker crashes you may 
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some 
> way.  The commit ID should at a minimum be written out with the data so 
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially 
> written is leaked, and never moved to the final location, because it is not 
> rotated.  I personally think the actions are too generic for this case and 
> need to be deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to