I am working with moonwatcher.
I think we found the problem: the dataset we were using has over 3000 files,
for each file, hadoop has to start a task (new JVM). After we combine all the
files into one, the job was finished very very quickly.
I think the task threads should be pooled and reused.
ds
moonwatcher <[EMAIL PROTECTED]> wrote:
i am using hadoop-0.12.3.
i am using a single node cluster -- running hdfs daemons, job tracker, task
tracker.
the dataset is about 12 MB of log files.
other information:
on the map phase, cpu went really high, close to 100%.
on the reduce phase, cpu was near zero, usually 1% or 2%. but the reduce
phase did complete eventually, and produced the correct output. this is
consistent behaviour.
thanks,
mw
Doug Cutting wrote: What version of Hadoop are you using? On what sort of a
cluster? How
big is your dataset?
Doug
moonwatcher wrote:
> hey guys,
>
> i've setup hadoop in distributed mode (jobtracker, tasktracker, and hdfs
> daemons), and observing that the map phase executes really quickly but the
> reduce phase is really slow. the application is simply to read some log
> files, whose lines constitute of key-value pairs, and summarize based on the
> keys, summing the values... so this seems like an ideal application of
> hadoop.
>
> could you suggest where the bottleneck might be? by logging, i observed that
> it is not in my reducer implementation. could it be in the RPC? or the sort
> or copying phases?
> would there be any certain properties that should be tweaked?
>
> thanks and best regards,
> mw
>
>
>
> ---------------------------------
> Ahhh...imagining that irresistible "new car" smell?
> Check outnew cars at Yahoo! Autos.
---------------------------------
Ahhh...imagining that irresistible "new car" smell?
Check outnew cars at Yahoo! Autos.
---------------------------------
Ahhh...imagining that irresistible "new car" smell?
Check outnew cars at Yahoo! Autos.