[
https://issues.apache.org/jira/browse/STORM-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215158#comment-16215158
]
Jungtaek Lim commented on STORM-2786:
-------------------------------------
Great finding!
There're generally two kinds of failed tuples - explicit fail and timeout - and
explicitly failed tuples will be remove from the map via
https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L120-L122
but timed-out tuples still would be leaked because of what you observed.
Spout handles timed-out tuples independently hence only memory leak will occur
which makes users not able be aware of.
> Ackers leak tracking info on failure and lots of other cases.
> -------------------------------------------------------------
>
> Key: STORM-2786
> URL: https://issues.apache.org/jira/browse/STORM-2786
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-client, storm-core
> Affects Versions: 0.9.1-incubating, 0.10.0, 1.0.0, 2.0.0
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
> Priority: Critical
>
> Over the weekend we had an incident where ackers were running out of memory
> at a really scary rate. It turns out that they were having a lot of
> failures, for an unrelated reason, but each of the failures were resulting in
> tuple tracking being lost because...
> We don't send ticks to any system components ever...
> https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L384
> and ackers are system components.
> So the tracking map was never rotated and all failed tuples
> https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L97-L103
> Were never deleted from the map.
> This leak eventually made the ackers crash, and when they came back up the
> other components kept blasting them with messages that would never be fully
> acked which also leaked because of the tick problem.
> Looking back this has been in every release since 0.9.1-incubating. It
> appears to have been introduced by
> https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8
> So every apache release has this problem (which is the only reason I have not
> marked this as a blocker, because apparently it is not so bad that anyone has
> noticed in the past 4 years).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)