[
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711551#comment-14711551
]
ASF GitHub Bot commented on STORM-855:
--------------------------------------
Github user mjsax commented on the pull request:
https://github.com/apache/storm/pull/694#issuecomment-134656876
I just checked some older benchmark result doing batching in user land, ie,
on top of Storm (=> Aeolus). For this case, a batch size of 100 increased the
spout output rate by a factor of 6 (instead of 1.5 as the benchmark above
shows). The benchmark should yield more than 70M tuples per 30 seconds... (and
not about 19M).
Of course, batching is done a little different now. In Aeolus, a fat-tuple
is used as batch. Thus, the system sees only a single batch-tuple. Now, the
system sees all tuples, but emitting is delayed until the batch is full (this
still saved the overhead of going through the disruptor for each tuple).
However, we generate a tuple-ID for each tuple in the batch, instead of a
single ID per batch. Not sure how expensive this is. Because acking was not
enabled, it should not be too expensive, because the IDs have not to be
"registered" at the ackers (right?).
As a further optimization, it might be a good idea not to batch whole
tuples, but only `Values` and tuple-id. The `worker-context`, `task-id`, and
`outstream-id` is the same for all tuples within a batch. I will try this out,
and push a new version the next days if it works.
> Add tuple batching
> ------------------
>
> Key: STORM-855
> URL: https://issues.apache.org/jira/browse/STORM-855
> Project: Apache Storm
> Issue Type: New Feature
> Reporter: Matthias J. Sax
> Assignee: Matthias J. Sax
> Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we
> aim to integrate this feature deep into the system (in contrast to building
> it on top), what has multiple advantages:
> - batching can be even more transparent to the user (eg, no extra
> direct-streams needed to mimic Storm's data distribution patterns)
> - fault-tolerance (anchoring/acking) can be done on a tuple granularity
> (not on a batch granularity, what leads to much more replayed tuples -- and
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter
> 'batch_size' to expose this feature to the user. Per default, batching will
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple
> processing semantics are preserved. An output batch is assembled at the
> producer and completely disassembled at the consumer. The consumer output can
> be batched again, however, independent of batched or non-batched input. Thus,
> batches can be of different size for each producer-consumer pair.
> Furthermore, consumers can receive batches of different size from different
> producers (including regular non batched input).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)