Robert Joseph Evans created STORM-2733:
------------------------------------------
Summary: Make Load Aware Shuffle much better at really bad
situations
Key: STORM-2733
URL: https://issues.apache.org/jira/browse/STORM-2733
Project: Apache Storm
Issue Type: Bug
Components: storm-client
Affects Versions: 1.0.0, 2.0.0
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
We recently had an issue where some bolts got really backed up and started to
die from OOMs. The issue ended up being 2 fold.
First the GC really slowed down the worker so much that it could not keep up
even with < 1% of the traffic that was still being sent to it. Which made it
almost impossible to recover.
The second issue was that the serialization of the tuples took a lot longer
than the processing, which resulted in the send queue filling up much more
quickly than the receive queue.
To help fix this issue I plan to address this in 2 ways. First we need a
better algorithm that can actually shut off the flow entirely to a very slow
bolt and second we need to take the send queue into account when shuffling.
This is not a full set of changes needed by STORM-2686 but it is a step in that
direction. I am going to try and set it up so that the two algorithms would
work nicely together.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)