[
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658863#comment-14658863
]
ASF GitHub Bot commented on STORM-837:
--------------------------------------
Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/644#issuecomment-128140442
I did a quick pass through the code and it looks OK, but I have not looked
at it in great detail. I am not very happy with the limitations on which
rotation policy you can use, nor on the size limit. I would rather be correct
but slow by default in all cases, even if they set bad configs (> 1GB), and
give them the power to make it fast but wrong if they know the risks and can
deal with it. Also a hard coded 1GB limit seems a little strange. What if we
have a 10GigE connection or even infiniband for HDFS and all of the data
happens to be in memory. We could in theory have processed the 1GB in less
then a second, still painful but not the end of the world.
Why don't we want to support a time based rotation, that rotates at the end
of a batch after the time has passed instead of in the middle of the batch?
> HdfsState ignores commits
> -------------------------
>
> Key: STORM-837
> URL: https://issues.apache.org/jira/browse/STORM-837
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Robert Joseph Evans
> Assignee: Arun Mahadevan
> Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once
> processing. It does this two ways, first by informing the state about
> commits so it can be sure the data is written out, and second by having a
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the
> ids. This means that if you use HdfsState and your worker crashes you may
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some
> way. The commit ID should at a minimum be written out with the data so
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially
> written is leaked, and never moved to the final location, because it is not
> rotated. I personally think the actions are too generic for this case and
> need to be deprecated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)