Robert Joseph Evans created STORM-2733:
------------------------------------------

             Summary: Make Load Aware Shuffle much better at really bad 
situations
                 Key: STORM-2733
                 URL: https://issues.apache.org/jira/browse/STORM-2733
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-client
    Affects Versions: 1.0.0, 2.0.0
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans


We recently had an issue where some bolts got really backed up and started to 
die from OOMs.  The issue ended up being 2 fold.

First the GC really slowed down the worker so much that it could not keep up 
even with < 1% of the traffic that was still being sent to it.  Which made it 
almost impossible to recover.

The second issue was that the serialization of the tuples took a lot longer 
than the processing, which resulted in the send queue filling up much more 
quickly than the receive queue.

To help fix this issue I plan to address this in 2 ways.  First we need a 
better algorithm that can actually shut off the flow entirely to a very slow 
bolt and second we need to take the send queue into account when shuffling.

This is not a full set of changes needed by STORM-2686 but it is a step in that 
direction.  I am going to try and set it up so that the two algorithms would 
work nicely together.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to