James Xu created STORM-135:
------------------------------

             Summary: Tracer bullets
                 Key: STORM-135
                 URL: https://issues.apache.org/jira/browse/STORM-135
             Project: Apache Storm (Incubating)
          Issue Type: New Feature
            Reporter: James Xu


https://github.com/nathanmarz/storm/issues/146

Debugging the flow of tuples through a Storm topology can be pretty tedious. 
One might have to do lots of logging and watch many log files, or do other 
kinds of instrumentation. It would be great to include a system to select 
certain tuples for tracing, and track the progress of those tuples through the 
topology.

Here is a use case:

Suppose one were to do stats aggregation using Storm. Some things I might want 
to ensure are:
Is the aggregation and flush happening in a timely way?
Are there hotspots?
Are there unexpected latencies? Are some bolts taking a long time?
To answer the above questions, I might select a random sample of tuples, or 
maybe a random sample of a specific subset of tuples. The tuples to be traced 
could be tagged with a special attribute.

I would want to track the following events:

Spout emit - send (task id, spout name, timestamp)
For each bolt:
When a traced tuple arrives and execute() is called: (task id, bolt name, 
timestamp)
When a tuple is emitted that is anchored on the tuple that arrived: (task id, 
bolt name, timestamp)
Here is what I can do with the data from above (assuming one can correlate 
tuples emitted with incoming tuples, based on the anchor):

For the aggregation bolt, I can look at the distribution of (emit timestamp - 
incoming timestamp) and see if it makes sense.
I can graph the life of one tuple, look at spout/bolt vs timestamp graph, and 
visually see how much time is being spent in each bolt, as well as how much 
time is spent in the Storm infrastructure / ZMQ.
This data can be overlayed for multiple tuples to get a sense of the timing 
distribution for the topology.
Using the task ID information, one can do a cool overlay graph that traces the 
distribution of a number of tuples over a topology. One can use that to see if 
field groupings are working, are unevenly distributed, etc.
For now I may start implementing this idea in scala-storm DSL.

----------
tdunning: I actually think that, if possible, unanchored tuples should also be 
traced.

One simple implementation would be to add some information to each tuple to 
indicate the tracing status of the tuple.

As each tuple arrives, the tracing status would be inspected. If set, a tracing 
wrapper for the collector would be used in place of the actual collector for 
that tuple. This would make tracing of all resulting tuples possible, not just 
the anchored tuples.

It would also be very useful to have a traceMessage() method on the collector 
that could be used by the bolt to record a trace message if tracing is on.

It would also make sense to have a method that turns tracing on or off for a 
collector. This might need to return a new tracing collector in order to allow 
collectors to be immutable.

The tracing information that I see would be useful includes:

a) possibly a trace level similar to the logging level used by log4j and other 
logging packages

b) a trace id so that multiple traces can be simultaneously active. This could 
be generated when tracing is turned on. It would be nice to have a provision to 
provide an external id that could be correlated to outside entities like a 
user-id.

----------
velvia: +1 for adding tracing level to the tuple metadata.

Nathan or others:

I think this ticket should be split up into a couple parts:
1) A generic callback or hook mechanism for when tuples are emitted and when 
they arrive via execute() in bolts.

2) A specific callback for filtering and implementing tracer bullets
3) Additional metadata in the Tuple class to track tracing, and changes to 
allow it to be serialized

Should this be split up into multiple issues?

Also pointers to where in the code the three could be implemented would be 
awesome.

Thanks!
Evan

----------
tdunning: With JIRA, sub-tasks would be a great idea. With Github's very basic 
issue tracker, probably not so much.

----------
nathanmarz: FYI, I've added hooks into Storm for 0.7.1

----------
tdunning: Can you provide a pointer or three to where the hooks are?

----------
nathanmarz: I explained it here: #153 (comment)

I'll have a wiki page about hooks once the feature is released.

----------
mrflip: @thedatachef has implemented this. We'd like guidance on the 
implementation choices made; you'll see the pull request shortly.

We targeted Trident, not Storm. It's our primary use case, and we want to see 
values at each operation boundary (not each bolter); meanwhile hooks seem to 
give good-enough support for Storm.
Trident Tuples have methods to set, unset and test if the tuple is traceable.
They become labeled as traceable with an assembly, which you can put in 
anywhere in the topology. We have have one such that makes every nth tuple 
traceable.
All descendants of a traceable tuple are traceable. The framework doesn't ever 
unlabel things, even if a tuple is prolific -- it's easy enough to thin the 
herd with an assembly.
When the collector emits a tuple, if the tuple is traceable it
anoints the new tuple as traceable
records the current step in the trace history -- a tracer bullet carries the 
history of every stage it's passed through.
writes an illustration of the trace history to the progress log. Since only a 
fraction of tuples are expected to be traceable, we feel efficiency is less 
important that this be structured, verbose and readable.
We don't do anything to preserve traceability across an aggregation, mostly 
because we don't know what to uniformly do in that case.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to