Confusing documentation for Trident State Management

Tech Id Sun, 14 May 2017 13:49:25 -0700

Hi,

We have been using Storm for more than an year now and generally very happy
with the product.
However I want to bring up a concern for the documentation of Trident.
Me and a few colleagues went over the documentation at
http://storm.apache.org/releases/current/Trident-state.html
But it is still unclear how the micro-batching achieves its goal.


Consider this statement for instance:
"*Suppose your topology computes word count and you want to store the word
counts in a key/value database ... what you can do is store the transaction
id with the count in the database as an atomic value. *"


*Question:*
The key/value database is the target of Trident topology or something
internal to Trident?
- If it is the target of the topology, how can we change the system from
being a key-value store to a key-value-txid store? It seems that we are
making the target itself as idempotent and not really providing
exactly-once semantics to Trident itself in a generic way because target's
idempotency is specific to the target and should not be considered an
exactly-once property of streaming.
- If the key-value store is something internal to Trident, then how are we
achieving exactly-once for the actual target? It seems that the actual
target and the key-value store are two different systems with no
distributed transaction. Clearly, one system can fail and the other can
succeed, thus loosing the exactly-once promise.


*Another Question:*
It seems that a micro-batch will be continuously replayed even if just one
word in that batch failed to update its count in the key-value store. If
that is true, the next batch will not even be sent to the target system
because that next batch might have the same word and since its previous
batch was not successful due to the same failing word, this next batch
would also fail. (Rule 3: "*State updates are ordered among batches. That
is, the state updates for batch 3 won't be applied until the state updates
for batch 2 have succeeded.*").
This means that the whole topology will be stuck at the failing batch.
This is much worse than single-tuple at-least-once delivery because a batch
may be having thousands to millions of messages and all of them are now
stuck with the topology coming to a stand-still. If the problem was
temporary, exactly-once-delivery will suffer a giant hit on latency as
compared to its atleast-once-delivery sibling.
If this is true, then this fact should be added to the documentation so
that users know beforehand what they are signing up for.


*A side note:*
This statement seems to be very old and completely untrue as of now:
"*One side note – once Kafka supports replication, it will be possible to
have transactional spouts that are fault-tolerant to node failure, but that
feature does not exist yet*"
Kafka does support replication and we should consider fixing the docs to
reflect the same.


Thanks
TI

Confusing documentation for Trident State Management

Reply via email to