[
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967622#comment-14967622
]
ASF GitHub Bot commented on STORM-855:
--------------------------------------
Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/765#issuecomment-149987537
So I have been running a number of tests trying to come to a conclusive
decision on how storm should handle batching, and trying to understand the
difference between my test results and the test results from #694. I ran the
word count test I wrote as a part of #805 on a 35 node storm cluster. This was
done against several different storm versions, the baseline in the #805 pull
request; this patch + #805 (batch-v2); and #694 + #805 + modifications to use
the hybrid approach to enable acking and batch to work in a multi-process
topology (STORM-855). To avoid having all of the numbers be hard to parse I am
just going to include some charts, but if anyone wants to see the raw numbers
or reproduce it themselves I am happy to provide data and/or branches.
The numbers below were collected after the topology had been running for at
least 200 seconds. This is to avoid startup issues like JIT etc. I filtered
out any 30 second interval where the measured throughput was not +/- 10% of the
target throughput on the assumption that if the topology cannot keep up with
the desired throughput or it was trying to catch up from previous slowness it
would not be within that range. I did not filter based off of the number of
failures that happened, simply because that would have resulted in removing all
of the STORM-855 with batching enabled results. None of the other test
configurations saw any failures at all during testing.

This shows the 99%-ile latency vs measured throughput. It is not too
interesting except to note that batching in STORM-855 at low throughput
resulted in nothing being fully processed. All of the tuples timed out before
they could finish. Only at a medium throughput above 16,000 sentences/second
were we able to maintain enough tuples to complete batches regularly, but even
then many tuples would still time out. This should be able to be fixed with a
batch timeout, but that is not implemented yet.
To get a better view I adjusted the latency to be a log scale.

From this we can see that on the very low end batching-v2 is increasing the
99%-ile latency from 5-10 ms to 19-21 ms. Most of that you can get back by
configuring the batch size to 1, instead of the default 100 tuples. However,
once the baseline stops functioning at around 7000 sentences/sec the batching
code is able to continue working, with either a batch size of 1 or 100. I
believe that this has to do with the automatic backpressure. In the baseline
code backpressure does not take into account the overflow buffer, but in the
batching code it does. I think this gives the topology more stability in
maintaining a throughput, but I don't have any solid evidence for that.
I then zoomed in on the graphs to show what a 2 second SLA would look like

and a 100 ms SLA.

In both cases the batching v2 with a batch size of 100 was able to handle
the highest throughput for that given latency.
Then I wanted to look at memory and CPU Utilization.

Memory does not show much, the amount of memory used varies a bit from one
to the other, but if you realize this is for 35 worker processes it is varying
from 70 MB/worker to about 200 MB/worker. The numbers simply show that as the
throughput increases the memory utilizations does too, and it does not vary too
much from one implementation to another.

CPU however shows that on the low end we are going from 7 or 8 cores worth
of CPU time to about 35 cores worth for the batching code. This seems to be
the result of the batch flushing threads waking up periodically. We should be
able to mitigate this by adjusting that interval to be larger, but that would
in turn impact the latency.
I believe that with further work we should be able to reduce that CPU
utilization and the latency on the low end by dynamically adjusting the batch
size and timeout based off of a specified SLA. At this point I feel this
branch is ready for a formal review and inclusion into storm, the drawbacks to
this patch do not seem to out weigh the clear advantages to it. Additionally
with the stability problems associated with #694 I cannot feel good in
recommending it at this time. It is clear that some of what it is doing is
worthwhile, and I think we should explore the alternative batch serialization
between worker processes as a potential stand alone piece.
@d2r @knusbaum @kishorvpatil @harshach @HeartSaVioR @ptgoetz
If you could please review this remembering that it is based off of #797, I
would like to try and get this in soon.
> Add tuple batching
> ------------------
>
> Key: STORM-855
> URL: https://issues.apache.org/jira/browse/STORM-855
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-core
> Reporter: Matthias J. Sax
> Assignee: Matthias J. Sax
> Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we
> aim to integrate this feature deep into the system (in contrast to building
> it on top), what has multiple advantages:
> - batching can be even more transparent to the user (eg, no extra
> direct-streams needed to mimic Storm's data distribution patterns)
> - fault-tolerance (anchoring/acking) can be done on a tuple granularity
> (not on a batch granularity, what leads to much more replayed tuples -- and
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter
> 'batch_size' to expose this feature to the user. Per default, batching will
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple
> processing semantics are preserved. An output batch is assembled at the
> producer and completely disassembled at the consumer. The consumer output can
> be batched again, however, independent of batched or non-batched input. Thus,
> batches can be of different size for each producer-consumer pair.
> Furthermore, consumers can receive batches of different size from different
> producers (including regular non batched input).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)