Felix,
Trident tracks batches using the ackers and coordinators, so as soon as a
trident spout outputs a tuple, it will start flowing through the topology.
When that batch completes fully the coordinator sends out another set of
messages to commit the results from that batch. When that completes, the next
batch is output.
The guarantee is that the batches will be committed in order, possibly with a
repeat. I have not dug into the details of how all of this works with the
state guarantees though.
- Bobby
On Monday, February 8, 2016 10:20 AM, Felix Dreissig <[email protected]> wrote:
Hi,
I’m re-posting this from Storm-User, as it didn’t get a reply there and touches
the internal implementation quite a bit.
I am trying to understand the parallelism properties and transactional
semantics offered by Trident and couldn’t find an answer to these two questions:
1. The „Trident Spouts“ documentation [1] says that „[b]y default, Trident
processes a single batch at a time, waiting for the batch to succeed or fail
before trying another batch“.
But do Trident bolts always wait until a batch is completed, collect the
results and then pass them on to the next bolt(s) as complete batches? Without
pipelining, this would mean that only one bolt can be active at a time,
effectively preventing any parallelism.
Or are tuples entering a stream and being delivered to the next bolt(s) as soon
as they are emitted? This would still introduce some idle time and increased
latency without pipelining, but at least seems like a better resource
utilization.
2. The idea that states will only ever have to deal with a new batch or one
from immediately before (assuming transactional or opaque-transactional spouts)
is at the core of Trident’s state model.
On this topic, the docs section from above promises that even with pipelining,
„Trident will order any state updates taking place in the topology among
batches“. Is this some special guarantee for the built-in stateful operations
(i.e. partitionPersist and persistentAggregate, which afaics uses
partitionPersist internally), or can all bolts assume that they’ll never see
any batch repeated except the latest one they processed?
I couldn’t find these questions covered in the docs or in previous discussions.
So I tried consulting the source code, but it’s not easily comprehensible with
regard to such issues.
Any help would be highly appreciated.
Best regards,
Felix
[1] https://storm.apache.org/documentation/Trident-spouts.html#pipelining