[
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716323#comment-14716323
]
Matthias J. Sax commented on STORM-855:
---------------------------------------
I never worked with Trident by myself, but as far as I understand,
micro-batching breaks tuple-by-tuple processing semantics. A batch of tuples is
assembled at the source and piped through the topology. The batch stays a batch
the whole time.
This is quite different from my approach: Tuples are only batched "under the
hood" and tuple-by-tuple processing semantics are preserved. A batch is
assembled at the output of an operator and "de-assembled" at the consumer. The
consumer does not need to batch its own output. Hence, batching is introduced
on a operator basis, ie, for each operator batching (the output) can be enabled
and disabled independently (also allowing for different batch sizes for
different operators and different batch sizes for different output streams).
Thus, the latency might not increase as much as batch size can be adjusted fine
grained. Additionally, if a single tuple fails, only this single tuple needs to
get replayed (and not the whole batch as in Trident).
Last but not least, [~revans2] encourage me to contribute this feature. Please
see here:
https://mail-archives.apache.org/mod_mbox/storm-dev/201505.mbox/%3C55672973.9040809%40informatik.hu-berlin.de%3E
> Add tuple batching
> ------------------
>
> Key: STORM-855
> URL: https://issues.apache.org/jira/browse/STORM-855
> Project: Apache Storm
> Issue Type: New Feature
> Reporter: Matthias J. Sax
> Assignee: Matthias J. Sax
> Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we
> aim to integrate this feature deep into the system (in contrast to building
> it on top), what has multiple advantages:
> - batching can be even more transparent to the user (eg, no extra
> direct-streams needed to mimic Storm's data distribution patterns)
> - fault-tolerance (anchoring/acking) can be done on a tuple granularity
> (not on a batch granularity, what leads to much more replayed tuples -- and
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter
> 'batch_size' to expose this feature to the user. Per default, batching will
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple
> processing semantics are preserved. An output batch is assembled at the
> producer and completely disassembled at the consumer. The consumer output can
> be batched again, however, independent of batched or non-batched input. Thus,
> batches can be of different size for each producer-consumer pair.
> Furthermore, consumers can receive batches of different size from different
> producers (including regular non batched input).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)