Hi, We have been using Storm for more than an year now and generally very happy with the product. However I want to bring up a concern for the documentation of Trident. Me and a few colleagues went over the documentation at http://storm.apache.org/releases/current/Trident-state.html But it is still unclear how the micro-batching achieves its goal.
Consider this statement for instance: "*Suppose your topology computes word count and you want to store the word counts in a key/value database ... what you can do is store the transaction id with the count in the database as an atomic value. *" *Question:* The key/value database is the target of Trident topology or something internal to Trident? - If it is the target of the topology, how can we change the system from being a key-value store to a key-value-txid store? It seems that we are making the target itself as idempotent and not really providing exactly-once semantics to Trident itself in a generic way because target's idempotency is specific to the target and should not be considered an exactly-once property of streaming. - If the key-value store is something internal to Trident, then how are we achieving exactly-once for the actual target? It seems that the actual target and the key-value store are two different systems with no distributed transaction. Clearly, one system can fail and the other can succeed, thus loosing the exactly-once promise. *Another Question:* It seems that a micro-batch will be continuously replayed even if just one word in that batch failed to update its count in the key-value store. If that is true, the next batch will not even be sent to the target system because that next batch might have the same word and since its previous batch was not successful due to the same failing word, this next batch would also fail. (Rule 3: "*State updates are ordered among batches. That is, the state updates for batch 3 won't be applied until the state updates for batch 2 have succeeded.*"). This means that the whole topology will be stuck at the failing batch. This is much worse than single-tuple at-least-once delivery because a batch may be having thousands to millions of messages and all of them are now stuck with the topology coming to a stand-still. If the problem was temporary, exactly-once-delivery will suffer a giant hit on latency as compared to its atleast-once-delivery sibling. If this is true, then this fact should be added to the documentation so that users know beforehand what they are signing up for. *A side note:* This statement seems to be very old and completely untrue as of now: "*One side note – once Kafka supports replication, it will be possible to have transactional spouts that are fault-tolerant to node failure, but that feature does not exist yet*" Kafka does support replication and we should consider fixing the docs to reflect the same. Thanks TI
