Andrew Olson created CRUNCH-680:
-----------------------------------

             Summary: Kafka Source should split very large partitions
                 Key: CRUNCH-680
                 URL: https://issues.apache.org/jira/browse/CRUNCH-680
             Project: Crunch
          Issue Type: Improvement
          Components: IO
            Reporter: Andrew Olson


If a single Kafka partition has a very large number of messages, the map task 
for that partition can take a long time to run leading to potential timeout 
problems. We should limit the number of messages assigned to each split so that 
the workload is more evenly balanced.

Based on our testing we have determined that 5 million messages should be a 
generally reasonable default for the maximum split size, with a configuration 
property (org.apache.crunch.kafka.split.max) provided to optionally override 
that value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to