Hi,

I’m re-posting this from Storm-User, as it didn’t get a reply there and touches 
the internal implementation quite a bit.

I am trying to understand the parallelism properties and transactional 
semantics offered by Trident and couldn’t find an answer to these two questions:

1. The „Trident Spouts“ documentation [1] says that „[b]y default, Trident 
processes a single batch at a time, waiting for the batch to succeed or fail 
before trying another batch“.
But do Trident bolts always wait until a batch is completed, collect the 
results and then pass them on to the next bolt(s) as complete batches? Without 
pipelining, this would mean that only one bolt can be active at a time, 
effectively preventing any parallelism.
Or are tuples entering a stream and being delivered to the next bolt(s) as soon 
as they are emitted? This would still introduce some idle time and increased 
latency without pipelining, but at least seems like a better resource 
utilization.

2. The idea that states will only ever have to deal with a new batch or one 
from immediately before (assuming transactional or opaque-transactional spouts) 
is at the core of Trident’s state model.
On this topic, the docs section from above promises that even with pipelining, 
„Trident will order any state updates taking place in the topology among 
batches“. Is this some special guarantee for the built-in stateful operations 
(i.e. partitionPersist and persistentAggregate, which afaics uses 
partitionPersist internally), or can all bolts assume that they’ll never see 
any batch repeated except the latest one they processed?

I couldn’t find these questions covered in the docs or in previous discussions. 
So I tried consulting the source code, but it’s not easily comprehensible with 
regard to such issues.
Any help would be highly appreciated.

Best regards,
Felix

[1] https://storm.apache.org/documentation/Trident-spouts.html#pipelining

Reply via email to