Ok so I'm experimenting with the slow running hive query I was having earlier. It was indeed only processing one map task at a time even though I *think* I told it to do more. Anyone who is good with hadoop feel free to speak up here as well, this is my first foray into trying to setup jobs for production. Here is the relevant configuration used on the job tracker and task tracker machines.

  <property>
    <name>mapred.map.tasks</name>
    <value>7</value>
<description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
  </property>

  <property>
    <name>mapred.reduce.parallel.copies</name>
    <value>20</value>
    <description>The default number of parallel transfers run by reduce
    during the copy(shuffle) phase.
    </description>
  </property>

  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>5</value>
    <description>The maximum number of map tasks that will be run
    simultaneously by a task tracker.
    </description>
  </property>

  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>5</value>
    <description>The maximum number of reduce tasks that will be run
    simultaneously by a task tracker.
    </description>
  </property>

The query was SELECT COUNT(DISTINCT(table.field)) FROM table;

Anyone know why this might only be running one map task at a time? Takes about 5 minutes to go through 344 of them at this rate.

Josh Ferguson

Reply via email to