Chris Riccomini created SAMZA-220:
-------------------------------------

             Summary: SystemConsumers is slow when consuming from a large 
number of partitions
                 Key: SAMZA-220
                 URL: https://issues.apache.org/jira/browse/SAMZA-220
             Project: Samza
          Issue Type: Bug
          Components: container
    Affects Versions: 0.6.0
            Reporter: Chris Riccomini


We have observed poor throughput when a SamzaContainer is consuming many 
partitions (100s). The more partitions, the worse the performance gets.

When hooking up VisualVM, two operations take up more than 65% of the CPU in 
SystemConsumers:

{code}
    refresh.maybeCall()
    updateMessageChooser
{code}

The problem is that we run each of these operations once before every process() 
call to a StreamTask. Both of these operations iterate over *all* 
SystemStreamPartitions that the SystemConsumers is consuming from. If you have 
hundreds of partitions, it means you do two loops of 100+ items for every 
message you process. This is true even if the SystemConsumers buffer has a lot 
of messages (10,000+), and also true even if most systemStreamPartitions have 
no messages available.

I have two proposed solutions to this problem:

1. Only call refresh.maybeCall() when the total number of buffered messages in 
the SystemConsumers has dropped below some low watermark.
2. Only have updateMessageChooser call messageChooser.update for 
systemStreamPartitions that actually *have* a message.

I have implemented this and deployed it on a few jobs, and I am seeing 
significant performance improvement. From 10k-20k msgs/sec to 50k+.

The trade off, as I see it is really around (1), which will introduce a little 
latency for topics that are low volume. In such a case, the time from when a 
message arrives to when it gets refreshed in the buffer, and updated in the 
chooser increases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to