[
https://issues.apache.org/jira/browse/STORM-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035890#comment-15035890
]
Robert Joseph Evans commented on STORM-1351:
--------------------------------------------
[~roshan_naik],
The bolts don't have any pause built in. Spouts read from the disruptor queue
in a non-blocking way, so they can handle the ack/fail messages that are sent
to them. They essentially are polling from two different sources, the user
given spout code and the disruptor queue, so the wait strategy is the way to
not do it in a busy loop. Bolts are only reading from a single source, the
disruptor queue. In the common case a bolt is blocking so if a client, like
HBase, is getting errors and the bolt thinks it needs to slow down its
processing, it can just sleep. This will cause the input queue to backup and
if you have load aware shuffle enabled it will start to try and route tuples
around the slow bolt giving them to bolts that are not getting errors. If all
of the bolts are getting errors automatic back-pressure, if it is enabled, will
stop the spouts from sending any more messages until the blockage clears up.
There are potentially lots of things we can do to improve on this system, as
the current back-pressure is a fairly big hammer, and it would be nice to only
stop spouts that can impact the blockage, or simply throttle them so they are
going at a better pace.
If you have an asynchronous bolt you can still implement a similar throttling
yourself in your bolt code by limiting the maximum number of tuples that can be
in flight at any point in time. None if this really needs to be a part of the
system. If you would like to work on a solution that make it more transparent
to bolts and provides system level metrics about throttling that seems OK.
> Storm spouts and bolts need a way to communicate problems back to toplogy
> runner
> --------------------------------------------------------------------------------
>
> Key: STORM-1351
> URL: https://issues.apache.org/jira/browse/STORM-1351
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Reporter: Roshan Naik
> Assignee: Roshan Naik
>
> A spout can be having a problem generating a tuple in nextTuple() because
> -a) there is no data to generate currently
> - b) there is some I/O issues it is experiencing
> If the spout returns immediately from the nextTuple() call then the
> nextTuple() will be invoked immediately leading to CPU spike. The CPU spike
> would last till the situation is remedied by new coming data or the i/o
> issues getting resolved.
> Currently to work around this, the spouts will have to implement a
> exponential backoff mechanism internally. There are two problems with this:
> - each spout needs to implement this backoff logic
> - since nextTuple() has an internal sleep and takes longer to return, the
> latency metrics computation gets thrown off
> *Thoughts for Solution:*
> The spout should be able to indicate a 'no data', 'experiencing error' or
> 'all good' status back to the caller of nextTuple() so that the right backoff
> logic can kick in.
> - The most natural way to do this is using the return type of the nextTuple
> method. Currently nextTuple() returns void. However, this will break source
> and binary compat since the new storm will not be able to invoke the methods
> on the unmodified spouts. This breaking change can only be considered as an
> option only prior to v1.0.
> - Alternatively this can be done by providing an additional method on the
> collector to indicate the condition to the topology runner. The spout can
> invoke this explicitly. the metrics can then also account for 'no data' and
> 'error attempts'
> - Alternatively - The toplogy runner may just examine the collector if there
> was new data generated by the nextTuple() call. In this case it cannot
> distinguish between errors v/s no incoming data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)