Re: newbie - map reduce not distributing

Dru Jensen Thu, 31 Jul 2008 16:05:03 -0700

UPDATE: I modified the RowCounter example and verified that it issending the same row to multiple map tasks also. Is this a known bugor am I doing something truly as(s)inine? Any help is appreciated.


On Jul 30, 2008, at 3:02 PM, Dru Jensen wrote:

J-D,

Again, thank you for your help on this.

hitting the HBASE Master port 60010:
System 1 - 2 regions
System 2 - 1 region
System 3 - 3 regions

In order to demonstrate the behavior I'm seeing, I wrote a test class.

public class Test extends Configured implements Tool {

    public static class Map extends TableMap {

        @Override
public void map(ImmutableBytesWritable key, RowResult row,OutputCollector output, Reporter r) throws IOException {
            String key_str = new String(key.get());
            System.out.println("map: key = " + key_str);
        }

    }

    public class Reduce extends TableReduce {

        @Override
public void reduce(WritableComparable key, Iterator values,OutputCollector output, Reporter r) throws IOException {
        }

    }

    public int run(String[] args) throws Exception {
        JobConf job = new JobConf(getConf(), Test.class);
        job.setJobName("Test");

        job.setNumMapTasks(4);
        job.setNumReduceTasks(1);
Map.initJob("test", "content:", Map.class, HStoreKey.class,HbaseMapWritable.class, job);
        Reduce.initJob("test", Reduce.class, job);

        JobClient.runJob(job);
        return 0;
    }

    public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Test(),args);
        System.exit(res);
    }
}

In hbase shell:
create 'test','content'
put 'test','test','content:test','testing'
put 'test','test2','content:test','testing2'


The Hadoop log results:
Task Logs: 'task_200807301447_0001_m_000000_0'



stdout logs
map: key = test
map: key = test2


stderr logs


syslog logs
2008-07-30 14:51:16,410 INFOorg.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metricswith processName=MAP, sessionId=2008-07-30 14:51:16,507 INFO org.apache.hadoop.mapred.MapTask:numReduceTasks: 12008-07-30 14:51:17,120 INFO org.apache.hadoop.mapred.TaskRunner:Task 'task_200807301447_0001_m_000000_0' done.
Task Logs: 'task_200807301447_0001_m_000001_0'



stdout logs
map: key = test
map: key = test2


stderr logs


syslog logs
2008-07-30 14:51:16,410 INFOorg.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metricswith processName=MAP, sessionId=2008-07-30 14:51:16,509 INFO org.apache.hadoop.mapred.MapTask:numReduceTasks: 12008-07-30 14:51:17,118 INFO org.apache.hadoop.mapred.TaskRunner:Task 'task_200807301447_0001_m_000001_0' done.
Tasks 3 and 4 are the same.
Each map task is seeing the same rows. Any help to prevent this isappreciated.
Thanks,
Dru


On Jul 30, 2008, at 2:22 PM, Jean-Daniel Cryans wrote:
Dru,
It is not supposed to process many times the same rows. Can I seethe logyou're talking about? Also, how many regions do you have in yourtable?
(info available in the web UI).

thx

J-D
On Wed, Jul 30, 2008 at 5:04 PM, Dru Jensen <[EMAIL PROTECTED]>wrote:
J-D,
thanks for your quick response. I have 4 mapping processesrunning on 3
systems.

Are the same rows being processed 4 times by each mapping processor?
According to the logs they are.

When I run a map/reduce against a file, only one row gets logged per
mapper.  Why would this be different for hbase tables?
I would think only one mapping process would process that one rowand it
would only show up once in only one log.
preferable it would be the same system that has the region.
I only want one row to be processed once. Is there anyway tochange this
behavior without running only 1 mapper?

thanks,
Dru


On Jul 30, 2008, at 1:44 PM, Jean-Daniel Cryans wrote:

Dru,
The regions will split when achieving a certain threshold so ifyou want
your computing to be distributed, you will have to have more data.

Regards,

J-D
On Wed, Jul 30, 2008 at 4:36 PM, Dru Jensen <[EMAIL PROTECTED]>wrote:
Hello,
I created a map/reduce process by extending the TableMap andTableReduce
API but for some reason
when I run multiple mappers, in the logs its showing that thesame rows
are
being processed by each Mapper.
When I say logs, I mean in the hadoop task tracker (localhost:50030) and
drilling down into the logs.
Do I need to manually perform a TableSplit or is this supposedto be done
automatically?
If its something I need to do manually, can someone point me tosome
sample
code?
If its supposed to be automatic and each mapper was supposed toget its
own
set of rows,
should I write up a bug for this? I using trunk 0.2.0 on hadooptrunk
0.17.2.

thanks,
Dru

Re: newbie - map reduce not distributing

Reply via email to