Hi all,

 

METRON-227 “Add Time-Based Flushing to Writer Bolt”, and 

METRON-322 “Global Batching & flushing”

have been dormant since July, but contain some very valuable ideas.  The basic 
idea is that Metron’s Writer queues in general will flush on queue size, but 
not on time.  As a result, low-traffic or bursty channels can languish 
unprocessed, and therefore un-ack’ed, which results in Storm automatically 
recycling the messages after a certain timeout (topology.message.timeout.secs), 
or if too many total pending messages accumulate in a topology 
(topology.max.spout.pending).  This results in duplicate messages and wasted 
computations, as well as unpredictable latency.

 

Storm now has a very nice, low-complexity solution for time-based flushing, 
using Tick Tuples.

I propose to use Tick Tuples to implement time-based flushing for all Writer 
queues that currently flush only on queue size.

I will do this work in the context of METRON-322, subsuming METRON-227 into it.

 

Per the recommendation of some members of the Storm implementation team, I will 
default the queue flush timeout (topology.tick.tuple.freq.secs) in each Writer 
to half the value of topology.message.timeout.secs (minus delta).  The default 
value of topology.message.timeout.secs is 30 seconds, so in many cases the 
queue flush times will be set to 14 seconds; but this will be configurable.

 

The reporter of METRON-322 was also concerned about “global” behavior of a 
topology, for instance the Enhancer topology with multiple telemetry-specific 
bolts in parallel.  If each individual bolt accumulates a number of un-ack’ed 
messages, the total across the whole topology can become large, and if 
topology.max.spout.pending is set, it may trigger.  However, the probability of 
this drops greatly if we implement a reasonable default for queue flush 
timeouts, and any remaining issue can be addressed by setting the bolt queue 
size limits, and the value of topology.max.spout.pending itself, appropriately. 
 Therefore, I will not at this time worry much about this “global” behavior, 
other than making sure that all Writers in the topology have queue flush 
timeouts.

 

Your thoughts, suggestions, and concerns are invited.

 

Thanks,

--Matt

 

 

Reply via email to