Hi all,
METRON-227 “Add Time-Based Flushing to Writer Bolt”, and METRON-322 “Global Batching & flushing” have been dormant since July, but contain some very valuable ideas. The basic idea is that Metron’s Writer queues in general will flush on queue size, but not on time. As a result, low-traffic or bursty channels can languish unprocessed, and therefore un-ack’ed, which results in Storm automatically recycling the messages after a certain timeout (topology.message.timeout.secs), or if too many total pending messages accumulate in a topology (topology.max.spout.pending). This results in duplicate messages and wasted computations, as well as unpredictable latency. Storm now has a very nice, low-complexity solution for time-based flushing, using Tick Tuples. I propose to use Tick Tuples to implement time-based flushing for all Writer queues that currently flush only on queue size. I will do this work in the context of METRON-322, subsuming METRON-227 into it. Per the recommendation of some members of the Storm implementation team, I will default the queue flush timeout (topology.tick.tuple.freq.secs) in each Writer to half the value of topology.message.timeout.secs (minus delta). The default value of topology.message.timeout.secs is 30 seconds, so in many cases the queue flush times will be set to 14 seconds; but this will be configurable. The reporter of METRON-322 was also concerned about “global” behavior of a topology, for instance the Enhancer topology with multiple telemetry-specific bolts in parallel. If each individual bolt accumulates a number of un-ack’ed messages, the total across the whole topology can become large, and if topology.max.spout.pending is set, it may trigger. However, the probability of this drops greatly if we implement a reasonable default for queue flush timeouts, and any remaining issue can be addressed by setting the bolt queue size limits, and the value of topology.max.spout.pending itself, appropriately. Therefore, I will not at this time worry much about this “global” behavior, other than making sure that all Writers in the topology have queue flush timeouts. Your thoughts, suggestions, and concerns are invited. Thanks, --Matt