Hi All,

This started out a quick slack post, then a reasonably sized email and now
it has headings!

*Introduction*

I am working on a performance modeling system for Heron. Hopefully this
system will be useful for checking proposed plans will meet performance
targets and also for checking if currently running physical plans will have
back pressure issues with higher traffic rates.

To do this I need to know what proportion of tuples are routed from each
upstream instance to its downstream instances, which is a metric that Heron
does not provide by default.

*Proposal*

I have implemented a custom metric to do what I need in my test topologies,
it is a simple multi-count metric called "__receive-count" where the key
now includes the "sourceTaskId" value (which you can get from the tuple
instance) as well as the source component name and incoming stream name.

This is basically the same as the default "__execute-count" metric but the
metric name format is
"__receive-count/<source-component>/<source-task-ID>/<incoming-stream>"
instead of "__execute-count/<source-component>/<incoming-stream>"

So I see two options:

   1. Create a new "__receive-count" metric and leave the "__execute-count"
   alone
   2. Alter "__execute-count" to include the source task ID.

*Questions*

My first question is weather the metric name is parsed anywhere further
down the line, such as aggregating component metrics in the metrics
manager? So changing the name would break things?

My second is if we do change "__execute-count" should we also add the
source task ID to other bolt metrics like "__execute-latency" (it would be
nice to see how latency changes by source instance --- this is a particular
issue in two consecutive fields grouped components as instances will
receive very different key distributions which could lead to very different
processing latency).

*Implementation*

To add this to the default metrics (or change "__execute-count") seems like
it would be reasonably straight forward (famous last words). We would need
to modify the `FullBoltMetric` class to include the new metrics (if
required) and edit the `FullBoltMetric.executeTuple` method to accept the
"sourceTaskId" (which is already available in the
"BoltInstance.readTuplesAndExecute" method) as a 4th argument.

Obviously, we will need to do the same with the Python implementation. Will
this also need to be changed in the Storm compatibility layer?

*Conclusion*

Having the information on where tuples are flowing is really important if
we want to be able to do more intelligent routing and adaptive auto-scaling
in the future and hopefully this one small change/extra metric won't add
any significant processing overhead.

I look forward to hearing what you think.

Cheers,

Tom Cooper
W: www.tomcooper.org.uk  | Twitter: @tomncooper
<https://twitter.com/tomncooper>

Reply via email to