[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

revans2 Mon, 31 Jul 2017 08:51:58 -0700

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2241
  
    Issues that need to be addressed to remove `max.spout.pending` (sorry about 
the wall of text).
    
    1. Worker to Worker backpressure.  The netty client and server have no 
backpressure in them.  If for some reason the network cannot keep up the netty 
client will continue to buffer messages internally in netty until you get 
GC/OOM issues like described in the design doc.  We ran into this when using 
the ZK based backpressure instead of `max.spout.pending` on a topology with a 
very high throughput and there was a network glitch.
    2. Single choke point getting messages off of a worker.  Currently there is 
a single disruptor queue and a single thread that is responsible for routing 
all of the messages from within the worker to external workers.  If any of the 
clients sending messages to other workers did block (backpressure) it would 
stop all other messages from leaving the worker.  In practice this negates the 
"a choke up in a single bolt will not put the brakes on all the topology 
components" from your design.  And as the number of workers in a topology grows 
the impact when this does happen will grow too.
    3. Load aware grouping needs to be back on by default and probably improved 
some.  Again one of the stated goals of your design for backpressure is "a 
choke up in a single bolt will not put the brakes on all the topology 
components".  Most of the existing groupings effectively negate this, as each 
upstream component will on average send a message to all down stream components 
relatively frequently.  Unless we can route around these backed up components 
one totally blocked bolt will stop all of a topology from functioning.  If you 
say the throughput drops by 20% when this is on, then we probably want to do 
some profiling and understand why this happens.  
    4. Timer Tick Tuples.  There are a number of things that rely on timer 
ticks.  Both system critical things, like ackers and spouts timing out tuples 
and metrics (at least for now); and code within bolts and spouts that want to 
do something periodically without having a separate thread.  Right now there is 
a single thread that does these ticks for each worker.  In the past it was a 
real pain to try and debug failures when it would block trying to insert a 
message into a queue that was full.  Metrics stopped flowing.  Spouts stopped 
timing things out, and other really odd things happened, that I cannot remember 
off the top of my head.  I don't want to have to go back to that.
    5. Cycles (non DAG topologies).  I know we strongly discourage users from 
building these types of things, and quite honestly I am happy to deprecate 
support for them, but we currently do support them.  So if we are going to stop 
supporting it lets get on with deprecating it in 1.x with clear warnings etc.  
Otherwise when this goes in there will be angry customers with deadlocked 
topologies.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

Reply via email to