Can you say a bit about where your bottleneck is? Is there one reduce that's taking a very long time? Can you check the logs and see which datatype that reducer is dealing with? There was some discussion of this on JIRA recently; consensus is that our current partitioner works well if you have a wide variety of datatypes, none of which is too big, and badly if you have one or two datatypes with lots of data in each.
On Mon, May 10, 2010 at 3:07 PM, Corbin Hoenes <cor...@tynt.com> wrote: > Is it possible to tune the time or size interval on demux to lower the amount > of time it takes to get demuxed data into the hadoop cluster? > (Or some other way?) Currently there is about a 20-30 minute lag on our > setup. Wondering also if this a wise thing to even try--maybe some side > effects? > > > > -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department