Hey guys, I've been using the hadoop consumer a whole lot this week, but I'm seeing pretty poor throughput with one task per partition. I figured a good solution would be to have multiple tasks per partition, so I wanted to run my assumptions by you all first:
This should enable the broker to round robin events between tasks right? When I record the high-watermark at the end of the mapreduce job there will be N entries for each partition (one per task), so is it correct to just take max(watermarks)? -- my assumption is that as they're getting events round-robin, everything should have been consumed up to the highest watermark found. Does this hold true? Is anyone else using the consumer like this? -- Matthew Rathbone Foursquare | Software Engineer | Server Engineering Team matt...@foursquare.com | @rathboma <http://twitter.com/rathboma> | 4sq<http://foursquare.com/rathboma>