[
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959782#comment-14959782
]
ASF GitHub Bot commented on STORM-855:
--------------------------------------
Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/765#issuecomment-148544657
@mjsax
What I saw when testing STORM-855 was that the maximum throughput was cut
almost in half from 10,000 sentences per second to 5,500. But your numbers
showed maximum throughput more than doubling from around 7,960,300 tuples sent
in 30 seconds to 16,347,100 in the same time period (no-acking). And 1,832,160
in 30 seconds to 2,323,580 an increase of 25% with acking.
To me this feels like a contradiction. The only thing I can think of is
that the messaging layer is so scary slow that cutting the maximum throughput
of a worker by half has no impact on the overall performance if it can double
the throughput of the messaging layer, by doing more batching.
This is likely the case, as on the high end 16,347,100 / 30 seconds / 24
workers is about 22,000 tuples per second per worker, where as 5,500 sentences
per second results in about 181,500 total tuples per second/worker being
processed.
I'm just looking for feedback from others on this, but it looks like I need
to do a distributed apples to apples comparison as well to see the impact the
messaging layer has.
> Add tuple batching
> ------------------
>
> Key: STORM-855
> URL: https://issues.apache.org/jira/browse/STORM-855
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-core
> Reporter: Matthias J. Sax
> Assignee: Matthias J. Sax
> Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we
> aim to integrate this feature deep into the system (in contrast to building
> it on top), what has multiple advantages:
> - batching can be even more transparent to the user (eg, no extra
> direct-streams needed to mimic Storm's data distribution patterns)
> - fault-tolerance (anchoring/acking) can be done on a tuple granularity
> (not on a batch granularity, what leads to much more replayed tuples -- and
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter
> 'batch_size' to expose this feature to the user. Per default, batching will
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple
> processing semantics are preserved. An output batch is assembled at the
> producer and completely disassembled at the consumer. The consumer output can
> be batched again, however, independent of batched or non-batched input. Thus,
> batches can be of different size for each producer-consumer pair.
> Furthermore, consumers can receive batches of different size from different
> producers (including regular non batched input).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)