Hi All, This started out a quick slack post, then a reasonably sized email and now it has headings!
*Introduction* I am working on a performance modeling system for Heron. Hopefully this system will be useful for checking proposed plans will meet performance targets and also for checking if currently running physical plans will have back pressure issues with higher traffic rates. To do this I need to know what proportion of tuples are routed from each upstream instance to its downstream instances, which is a metric that Heron does not provide by default. *Proposal* I have implemented a custom metric to do what I need in my test topologies, it is a simple multi-count metric called "__receive-count" where the key now includes the "sourceTaskId" value (which you can get from the tuple instance) as well as the source component name and incoming stream name. This is basically the same as the default "__execute-count" metric but the metric name format is "__receive-count/<source-component>/<source-task-ID>/<incoming-stream>" instead of "__execute-count/<source-component>/<incoming-stream>" So I see two options: 1. Create a new "__receive-count" metric and leave the "__execute-count" alone 2. Alter "__execute-count" to include the source task ID. *Questions* My first question is weather the metric name is parsed anywhere further down the line, such as aggregating component metrics in the metrics manager? So changing the name would break things? My second is if we do change "__execute-count" should we also add the source task ID to other bolt metrics like "__execute-latency" (it would be nice to see how latency changes by source instance --- this is a particular issue in two consecutive fields grouped components as instances will receive very different key distributions which could lead to very different processing latency). *Implementation* To add this to the default metrics (or change "__execute-count") seems like it would be reasonably straight forward (famous last words). We would need to modify the `FullBoltMetric` class to include the new metrics (if required) and edit the `FullBoltMetric.executeTuple` method to accept the "sourceTaskId" (which is already available in the "BoltInstance.readTuplesAndExecute" method) as a 4th argument. Obviously, we will need to do the same with the Python implementation. Will this also need to be changed in the Storm compatibility layer? *Conclusion* Having the information on where tuples are flowing is really important if we want to be able to do more intelligent routing and adaptive auto-scaling in the future and hopefully this one small change/extra metric won't add any significant processing overhead. I look forward to hearing what you think. Cheers, Tom Cooper W: www.tomcooper.org.uk | Twitter: @tomncooper <https://twitter.com/tomncooper>
