[ 
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562913#comment-14562913
 ] 

Robert Joseph Evans commented on STORM-837:
-------------------------------------------

HdfsState is not a MapState.  It is just a State, there are no get operations 
supported on it, it is a sink that writes all input to HDFS.  The issue is with 
others reading the data.  The readers in this case are likely to be a batch job 
using a Hadoop input format to read the data.  For regular storm it provides at 
most once or at least once semantics, and the HdfsBolt, which is also a sink, 
provides the exact same semantics, so if there is duplicate data or lost data 
it could be for more reasons then just the Bolt not syncing things correctly.  
In that case however, I would like to see the spout not ack a tuple until it 
has been synced to disk, that way we truly can be sure no data is lost, but 
that is another issue.

for trident we expect exactly once semantics, especially form something that 
comes as an official part of storm.  The File formats that the data is written 
out in are just a log with no ability to overwrite out of data data like a 
MapState can.  They also have no knowledge of zookeeper and the batch ids that 
have been or not been fully committed.  And even if they did have that 
knowledge the entries have to commit ID in them to let the reader know which 
ones it should ignore while reading.

> HdfsState ignores commits
> -------------------------
>
>                 Key: STORM-837
>                 URL: https://issues.apache.org/jira/browse/STORM-837
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once 
> processing.  It does this two ways, first by informing the state about 
> commits so it can be sure the data is written out, and second by having a 
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the 
> ids.  This means that if you use HdfsState and your worker crashes you may 
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some 
> way.  The commit ID should at a minimum be written out with the data so 
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially 
> written is leaked, and never moved to the final location, because it is not 
> rotated.  I personally think the actions are too generic for this case and 
> need to be deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to