[
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Joseph Evans updated STORM-837:
--------------------------------------
Comment: was deleted
(was: HdfsState is not a MapState. It is just a State, there are no get
operations supported on it, it is a sink that writes all input to HDFS. The
issue is with others reading the data. The readers in this case are likely to
be a batch job using a Hadoop input format to read the data. For regular storm
it provides at most once or at least once semantics, and the HdfsBolt, which is
also a sink, provides the exact same semantics, so if there is duplicate data
or lost data it could be for more reasons then just the Bolt not syncing things
correctly. In that case however, I would like to see the spout not ack a tuple
until it has been synced to disk, that way we truly can be sure no data is
lost, but that is another issue.
for trident we expect exactly once semantics, especially form something that
comes as an official part of storm. The File formats that the data is written
out in are just a log with no ability to overwrite out of data data like a
MapState can. They also have no knowledge of zookeeper and the batch ids that
have been or not been fully committed. And even if they did have that
knowledge the entries have to commit ID in them to let the reader know which
ones it should ignore while reading.)
> HdfsState ignores commits
> -------------------------
>
> Key: STORM-837
> URL: https://issues.apache.org/jira/browse/STORM-837
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Robert Joseph Evans
> Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once
> processing. It does this two ways, first by informing the state about
> commits so it can be sure the data is written out, and second by having a
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the
> ids. This means that if you use HdfsState and your worker crashes you may
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some
> way. The commit ID should at a minimum be written out with the data so
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially
> written is leaked, and never moved to the final location, because it is not
> rotated. I personally think the actions are too generic for this case and
> need to be deprecated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)