James Xu created STORM-135:
------------------------------
Summary: Tracer bullets
Key: STORM-135
URL: https://issues.apache.org/jira/browse/STORM-135
Project: Apache Storm (Incubating)
Issue Type: New Feature
Reporter: James Xu
https://github.com/nathanmarz/storm/issues/146
Debugging the flow of tuples through a Storm topology can be pretty tedious.
One might have to do lots of logging and watch many log files, or do other
kinds of instrumentation. It would be great to include a system to select
certain tuples for tracing, and track the progress of those tuples through the
topology.
Here is a use case:
Suppose one were to do stats aggregation using Storm. Some things I might want
to ensure are:
Is the aggregation and flush happening in a timely way?
Are there hotspots?
Are there unexpected latencies? Are some bolts taking a long time?
To answer the above questions, I might select a random sample of tuples, or
maybe a random sample of a specific subset of tuples. The tuples to be traced
could be tagged with a special attribute.
I would want to track the following events:
Spout emit - send (task id, spout name, timestamp)
For each bolt:
When a traced tuple arrives and execute() is called: (task id, bolt name,
timestamp)
When a tuple is emitted that is anchored on the tuple that arrived: (task id,
bolt name, timestamp)
Here is what I can do with the data from above (assuming one can correlate
tuples emitted with incoming tuples, based on the anchor):
For the aggregation bolt, I can look at the distribution of (emit timestamp -
incoming timestamp) and see if it makes sense.
I can graph the life of one tuple, look at spout/bolt vs timestamp graph, and
visually see how much time is being spent in each bolt, as well as how much
time is spent in the Storm infrastructure / ZMQ.
This data can be overlayed for multiple tuples to get a sense of the timing
distribution for the topology.
Using the task ID information, one can do a cool overlay graph that traces the
distribution of a number of tuples over a topology. One can use that to see if
field groupings are working, are unevenly distributed, etc.
For now I may start implementing this idea in scala-storm DSL.
----------
tdunning: I actually think that, if possible, unanchored tuples should also be
traced.
One simple implementation would be to add some information to each tuple to
indicate the tracing status of the tuple.
As each tuple arrives, the tracing status would be inspected. If set, a tracing
wrapper for the collector would be used in place of the actual collector for
that tuple. This would make tracing of all resulting tuples possible, not just
the anchored tuples.
It would also be very useful to have a traceMessage() method on the collector
that could be used by the bolt to record a trace message if tracing is on.
It would also make sense to have a method that turns tracing on or off for a
collector. This might need to return a new tracing collector in order to allow
collectors to be immutable.
The tracing information that I see would be useful includes:
a) possibly a trace level similar to the logging level used by log4j and other
logging packages
b) a trace id so that multiple traces can be simultaneously active. This could
be generated when tracing is turned on. It would be nice to have a provision to
provide an external id that could be correlated to outside entities like a
user-id.
----------
velvia: +1 for adding tracing level to the tuple metadata.
Nathan or others:
I think this ticket should be split up into a couple parts:
1) A generic callback or hook mechanism for when tuples are emitted and when
they arrive via execute() in bolts.
2) A specific callback for filtering and implementing tracer bullets
3) Additional metadata in the Tuple class to track tracing, and changes to
allow it to be serialized
Should this be split up into multiple issues?
Also pointers to where in the code the three could be implemented would be
awesome.
Thanks!
Evan
----------
tdunning: With JIRA, sub-tasks would be a great idea. With Github's very basic
issue tracker, probably not so much.
----------
nathanmarz: FYI, I've added hooks into Storm for 0.7.1
----------
tdunning: Can you provide a pointer or three to where the hooks are?
----------
nathanmarz: I explained it here: #153 (comment)
I'll have a wiki page about hooks once the feature is released.
----------
mrflip: @thedatachef has implemented this. We'd like guidance on the
implementation choices made; you'll see the pull request shortly.
We targeted Trident, not Storm. It's our primary use case, and we want to see
values at each operation boundary (not each bolter); meanwhile hooks seem to
give good-enough support for Storm.
Trident Tuples have methods to set, unset and test if the tuple is traceable.
They become labeled as traceable with an assembly, which you can put in
anywhere in the topology. We have have one such that makes every nth tuple
traceable.
All descendants of a traceable tuple are traceable. The framework doesn't ever
unlabel things, even if a tuple is prolific -- it's easy enough to thin the
herd with an assembly.
When the collector emits a tuple, if the tuple is traceable it
anoints the new tuple as traceable
records the current step in the trace history -- a tracer bullet carries the
history of every stage it's passed through.
writes an illustration of the trace history to the progress log. Since only a
fraction of tuples are expected to be traceable, we feel efficiency is less
important that this be structured, verbose and readable.
We don't do anything to preserve traceability across an aggregation, mostly
because we don't know what to uniformly do in that case.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)