[
https://issues.apache.org/jira/browse/STORM-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-154:
-------------------------------
Component/s: storm-core
> Provide more information to spout "fail" method
> -----------------------------------------------
>
> Key: STORM-154
> URL: https://issues.apache.org/jira/browse/STORM-154
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-core
> Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/39
> It might be helpful to distinguish between unexpected errors (when they can
> be caught) and timeouts.
> ----------
> conflagrator: +1 on this. I wrote a class extending OutputCollector with the
> following wrapper functions:
> public class VerboseOutputCollector extends OutputCollector {
> public void fail(Tuple tuple) {}
> public void fail(Tuple tuple, String message) {}
> public void fail(Tuple tuple, Exception e) {}
> public void fail(Tuple tuple, Exception e, String message) {}
> }
> Each function generates an output containing the class and the line number of
> the "fail" call and the message or Exception, if provided. It's very handy
> for log analytic.
> ----------
> dmoore247: +1
> With 0.8.1 on a local cluster I've spent many hours tracking down failures,
> going through executor.clj code, turning on full logging, adding TaskHooks,
> playing with time out parameters, adding exception handling etc.
> As an aside, the SpoutFail....latencyMs value was always a null in my tests
> on the LocalCluster.
> Still, all I know is that the message failed, but not why (Timeout?).
> Based on playing with the timeout parameters, I deduce that the failures were
> caused by timeouts.
> Where in Storm does it determine, hey, we've exceeded a timeout, let's fail
> this Tuple? At least we/I could add debug message to Storm.
> Many thanks.
> ----------
> ruleb: +1
> Had the same situation, searched a whole day to conclude that a trident
> topology regularly dropped complete batches of tuples because of timeout
> reached when they are queued up at a busy bolt.
> Having a small "tuple timeout reached" in the logs @ info level will save
> many developer days.
> Many thanks.
> ----------
> thecoop: This would be very helpful to determine why tuples are failing,
> rather than just an arbitrary number in the UI - just putting something in
> the logs as an info or warn saying a tuple failed and some information on why
> it failed.
> ----------
> brianantonelli: +1
> Would be great to get more information about what caused the spout to fail.
> I'm also seeing that the latency is always null too.
> ----------
> revans2: It is fairly simple to extend spout to indicate if a tuple failed
> because of a timeout or if it failed because of something else, but it is
> much harder to determine what that something else was. The fail API on all
> output collectors does not have anything that could be used to map it to a
> reason. We would have to extend the API and decide what the failure reason
> should look like. Perhaps a free form string, but that is really horrible if
> you want to aggregate the failures in metrics. Also we would want to limit
> the size of the string so an to not overwhelm the acker bolts.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)