[ 
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959782#comment-14959782
 ] 

ASF GitHub Bot commented on STORM-855:
--------------------------------------

Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/765#issuecomment-148544657
  
    @mjsax 
    
    What I saw when testing STORM-855 was that the maximum throughput was cut 
almost in half from 10,000 sentences per second to 5,500.  But your numbers 
showed maximum throughput more than doubling from around 7,960,300 tuples sent 
in 30 seconds to 16,347,100 in the same time period (no-acking).  And 1,832,160 
in 30 seconds to 2,323,580 an increase of 25% with acking.
    
    To me this feels like a contradiction. The only thing I can think of is 
that the messaging layer is so scary slow that cutting the maximum throughput 
of a worker by half has no impact on the overall performance if it can double 
the throughput of the messaging layer, by doing more batching.
    
    This is likely the case, as on the high end 16,347,100 / 30 seconds / 24 
workers is about 22,000 tuples per second per worker, where as 5,500 sentences 
per second results in about 181,500 total tuples per second/worker being 
processed.
    
    I'm just looking for feedback from others on this, but it looks like I need 
to do a distributed apples to apples comparison as well to see the impact the 
messaging layer has.


> Add tuple batching
> ------------------
>
>                 Key: STORM-855
>                 URL: https://issues.apache.org/jira/browse/STORM-855
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: Matthias J. Sax
>            Assignee: Matthias J. Sax
>            Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped 
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to 
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we 
> aim to integrate this feature deep into the system (in contrast to building 
> it on top), what has multiple advantages:
>   - batching can be even more transparent to the user (eg, no extra 
> direct-streams needed to mimic Storm's data distribution patterns)
>   - fault-tolerance (anchoring/acking) can be done on a tuple granularity 
> (not on a batch granularity, what leads to much more replayed tuples -- and 
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter 
> 'batch_size' to expose this feature to the user. Per default, batching will 
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple 
> processing semantics are preserved. An output batch is assembled at the 
> producer and completely disassembled at the consumer. The consumer output can 
> be batched again, however, independent of batched or non-batched input. Thus, 
> batches can be of different size for each producer-consumer pair. 
> Furthermore, consumers can receive batches of different size from different 
> producers (including regular non batched input).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to