Robert Joseph Evans created STORM-2786:
------------------------------------------
Summary: Ackers leak tracking info on failure and lots of other
cases.
Key: STORM-2786
URL: https://issues.apache.org/jira/browse/STORM-2786
Project: Apache Storm
Issue Type: Bug
Components: storm-client, storm-core
Affects Versions: 0.9.1-incubating, 0.10.0, 1.0.0, 2.0.0
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Critical
Over the weekend we had an incident where ackers were running out of memory at
a really scary rate. It turns out that they were having a lot of failures, for
an unrelated reason, but each of the failures were resulting in tuple tracking
being lost because...
We don't send ticks to any system components ever...
https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L384
and ackers are system components.
So the tracking map was never rotated and all failed tuples
https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L97-L103
Were never deleted from the map.
This leak eventually made the ackers crash, and when they came back up the
other components kept blasting them with messages that would never be fully
acked which also leaked because of the tick problem.
Looking back this has been in every release since 0.9.1-incubating. It appears
to have been introduced by
https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8
So every apache release has this problem (which is the only reason I have not
marked this as a blocker, because apparently it is not so bad that anyone has
noticed in the past 4 years).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)