[
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635679#comment-14635679
]
ASF GitHub Bot commented on STORM-837:
--------------------------------------
GitHub user arunmahadevan opened a pull request:
https://github.com/apache/storm/pull/644
[STORM-837] Support for exactly once semantics in HdfsState
Changes to support exactly once semantics in HdfsState.
1. Moved the file rotation and sync to commit()
2. In pre-commit, if we have previously seen the txnid, recover the data up
to that point by copying to a new file and discard the current data file.
3. In pre-commit atomically update [current txid, the datafile path and the
current offset] in a (per partition) index file.
4. To keep it simple, automatically turn off exactly once semantics if
TimedRotation policy is in use.
Have tested with the normal flow and simulating the recovery scenario, with
both regular Hdfs file and sequence files. Appears to work fine.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/arunmahadevan/storm master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/644.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #644
----
commit 5631b9d4746f34127a1bc89cb4488d2b2d8ec9d7
Author: Arun Mahadevan <[email protected]>
Date: 2015-07-21T19:04:09Z
Support for exactly once semantics in HdfsState
----
> HdfsState ignores commits
> -------------------------
>
> Key: STORM-837
> URL: https://issues.apache.org/jira/browse/STORM-837
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Robert Joseph Evans
> Assignee: Arun Mahadevan
> Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once
> processing. It does this two ways, first by informing the state about
> commits so it can be sure the data is written out, and second by having a
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the
> ids. This means that if you use HdfsState and your worker crashes you may
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some
> way. The commit ID should at a minimum be written out with the data so
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially
> written is leaked, and never moved to the final location, because it is not
> rotated. I personally think the actions are too generic for this case and
> need to be deprecated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)