Ok so I'm experimenting with the slow running hive query I was having
earlier. It was indeed only processing one map task at a time even
though I *think* I told it to do more. Anyone who is good with hadoop
feel free to speak up here as well, this is my first foray into trying
to setup jobs for production. Here is the relevant configuration used
on the job tracker and task tracker machines.
<property>
<name>mapred.map.tasks</name>
<value>7</value>
<description>The default number of map tasks per job. Typically
set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
<description>The default number of parallel transfers run by reduce
during the copy(shuffle) phase.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>5</value>
<description>The maximum number of map tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>5</value>
<description>The maximum number of reduce tasks that will be run
simultaneously by a task tracker.
</description>
</property>
The query was SELECT COUNT(DISTINCT(table.field)) FROM table;
Anyone know why this might only be running one map task at a time?
Takes about 5 minutes to go through 344 of them at this rate.
Josh Ferguson