There is overhead in grabbing local data, moving it in and out of the system and especially if you are running a map reduce job (like wc) which ends up mapping, sorting, copying, reducing, and writing again.
One way I've found to get around the overhead is to use Hadoop streaming and perform map only tasks. While they recommend doing it properly with hstream -mapper /bin/cat -reducer /bin/wc I tried: hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc -numReduceTasks 0 (hstream is just an alias to do Hadoop streaming) And saw an immediate speedup on a 1 Gig and 10 Gig file. In the end you may have several output files with the wordcount for each file, but adding those files together is pretty quick and easy. My recommendation is to explore how how you can get away with either Identity Reduces, Maps or no reduces at all. Theo On Tue, Mar 11, 2008 at 4:21 PM, Jason Rennie <[EMAIL PROTECTED]> wrote: > On Tue, Mar 11, 2008 at 5:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Yes. Each task is launching a JVM. > > > Guess that would explain the slowness :) Is HDFS tuned similarly? We're > thinking of possibly distributing our data using HDFS but storing a > sufficiently small amount of data per node so that the linux kernel could > buffer it all into memory. Is there much overhead in grabbing data from > HDFS if that data is stored locally? > > Map reduce is not generally useful for real-time applications. It is VERY > > useful for large scale data reductions done in advance of real-time > > operations. > > > > The basic issue is that the major performance contribution of map-reduce > > architectures is large scale sequential access of data stores. That is > > pretty much in contradiction with real-time response. > > > > Gotcha. We'll consider switching to a batch-style approach, which it > sounds > like Hadoop would be perfect for. > > Thanks, > > Jason > -- Theodore Van Rooy http://greentheo.scroggles.com
