J-D,
I found what is causing the same rows being sent to multiple map
tasks.
If you have the same column family name in other tables, the Test
will send the same rows to multiple map reducers.
I'm attaching the DEBUG logs and the test class.
thanks,
Dru
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
Steps to duplicate:
1. "hadoop dfs -rmr /hbase" and started hbase with no tables.
2. launch hbase shell and create table:
create 'test', 'content'
put 'test','test','content:test','testing'
put 'test','test2','content:test','testing2'
3. Run the Test MapReduce, only one map task is launched. This is
correct.
Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
08/08/01 10:23:07 INFO mapred.JobClient: Running job:
job_200808010944_0006
08/08/01 10:23:08 INFO mapred.JobClient: map 0% reduce 0%
08/08/01 10:23:12 INFO mapred.JobClient: map 100% reduce 0%
08/08/01 10:23:14 INFO mapred.JobClient: map 100% reduce 100%
08/08/01 10:23:15 INFO mapred.JobClient: Job complete:
job_200808010944_0006
08/08/01 10:23:15 INFO mapred.JobClient: Counters: 13
08/08/01 10:23:15 INFO mapred.JobClient: Job Counters 08/08/01
10:23:15 INFO mapred.JobClient: Launched map tasks=1
08/08/01 10:23:15 INFO mapred.JobClient: Launched reduce tasks=1
08/08/01 10:23:15 INFO mapred.JobClient: Map-Reduce Framework
08/08/01 10:23:15 INFO mapred.JobClient: Map input records=2
08/08/01 10:23:15 INFO mapred.JobClient: Map output records=0
08/08/01 10:23:15 INFO mapred.JobClient: Map input bytes=0
08/08/01 10:23:15 INFO mapred.JobClient: Map output bytes=0
08/08/01 10:23:15 INFO mapred.JobClient: Combine input records=0
08/08/01 10:23:15 INFO mapred.JobClient: Combine output records=0
08/08/01 10:23:15 INFO mapred.JobClient: Reduce input groups=0
08/08/01 10:23:15 INFO mapred.JobClient: Reduce input records=0
08/08/01 10:23:15 INFO mapred.JobClient: Reduce output records=0
08/08/01 10:23:15 INFO mapred.JobClient: File Systems
08/08/01 10:23:15 INFO mapred.JobClient: Local bytes read=136
08/08/01 10:23:15 INFO mapred.JobClient: Local bytes written=280
4. create another table with 'content' column family
create 'weird', 'content'
5. Run the same test. Two map tasks are launched even though
column is in a different table.
Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
08/08/01 10:24:15 INFO mapred.JobClient: Running job:
job_200808010944_0007
08/08/01 10:24:16 INFO mapred.JobClient: map 0% reduce 0%
08/08/01 10:24:19 INFO mapred.JobClient: map 50% reduce 0%
08/08/01 10:24:21 INFO mapred.JobClient: map 100% reduce 0%
08/08/01 10:24:26 INFO mapred.JobClient: map 100% reduce 16%
08/08/01 10:24:27 INFO mapred.JobClient: map 100% reduce 100%
08/08/01 10:24:28 INFO mapred.JobClient: Job complete:
job_200808010944_0007
08/08/01 10:24:28 INFO mapred.JobClient: Counters: 13
08/08/01 10:24:28 INFO mapred.JobClient: Job Counters 08/08/01
10:24:28 INFO mapred.JobClient: Launched map tasks=2
08/08/01 10:24:28 INFO mapred.JobClient: Launched reduce tasks=1
08/08/01 10:24:28 INFO mapred.JobClient: Map-Reduce Framework
08/08/01 10:24:28 INFO mapred.JobClient: Map input records=4
08/08/01 10:24:28 INFO mapred.JobClient: Map output records=0
08/08/01 10:24:28 INFO mapred.JobClient: Map input bytes=0
08/08/01 10:24:28 INFO mapred.JobClient: Map output bytes=0
08/08/01 10:24:28 INFO mapred.JobClient: Combine input records=0
08/08/01 10:24:28 INFO mapred.JobClient: Combine output records=0
08/08/01 10:24:28 INFO mapred.JobClient: Reduce input groups=0
08/08/01 10:24:28 INFO mapred.JobClient: Reduce input records=0
08/08/01 10:24:28 INFO mapred.JobClient: Reduce output records=0
08/08/01 10:24:28 INFO mapred.JobClient: File Systems
08/08/01 10:24:28 INFO mapred.JobClient: Local bytes read=136
08/08/01 10:24:28 INFO mapred.JobClient: Local bytes written=424
6. create another table that doesn't have 'content' column family
create 'notweird', 'notcontent'
7. run the test. Still only 2 map tasks.
Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
08/08/01 10:24:57 INFO mapred.JobClient: Running job:
job_200808010944_0008
08/08/01 10:24:58 INFO mapred.JobClient: map 0% reduce 0%
08/08/01 10:25:02 INFO mapred.JobClient: map 50% reduce 0%
08/08/01 10:25:03 INFO mapred.JobClient: map 100% reduce 0%
08/08/01 10:25:09 INFO mapred.JobClient: map 100% reduce 100%
08/08/01 10:25:10 INFO mapred.JobClient: Job complete:
job_200808010944_0008
08/08/01 10:25:10 INFO mapred.JobClient: Counters: 13
08/08/01 10:25:10 INFO mapred.JobClient: Job Counters 08/08/01
10:25:10 INFO mapred.JobClient: Launched map tasks=2
08/08/01 10:25:10 INFO mapred.JobClient: Launched reduce tasks=1
08/08/01 10:25:10 INFO mapred.JobClient: Map-Reduce Framework
08/08/01 10:25:10 INFO mapred.JobClient: Map input records=4
08/08/01 10:25:10 INFO mapred.JobClient: Map output records=0
08/08/01 10:25:10 INFO mapred.JobClient: Map input bytes=0
08/08/01 10:25:10 INFO mapred.JobClient: Map output bytes=0
08/08/01 10:25:10 INFO mapred.JobClient: Combine input records=0
08/08/01 10:25:10 INFO mapred.JobClient: Combine output records=0
08/08/01 10:25:10 INFO mapred.JobClient: Reduce input groups=0
08/08/01 10:25:10 INFO mapred.JobClient: Reduce input records=0
08/08/01 10:25:10 INFO mapred.JobClient: Reduce output records=0
08/08/01 10:25:10 INFO mapred.JobClient: File Systems
08/08/01 10:25:10 INFO mapred.JobClient: Local bytes read=136
08/08/01 10:25:10 INFO mapred.JobClient: Local bytes written=424
8. create another table with two families. One called 'content'
create 'weirdest', 'content','notcontent'
9. run test. 3 map tasks are launched.
Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
08/08/01 10:25:41 INFO mapred.JobClient: Running job:
job_200808010944_0009
08/08/01 10:25:42 INFO mapred.JobClient: map 0% reduce 0%
08/08/01 10:25:45 INFO mapred.JobClient: map 33% reduce 0%
08/08/01 10:25:46 INFO mapred.JobClient: map 100% reduce 0%
08/08/01 10:25:50 INFO mapred.JobClient: map 100% reduce 100%
08/08/01 10:25:51 INFO mapred.JobClient: Job complete:
job_200808010944_0009
08/08/01 10:25:51 INFO mapred.JobClient: Counters: 13
08/08/01 10:25:51 INFO mapred.JobClient: Job Counters 08/08/01
10:25:51 INFO mapred.JobClient: Launched map tasks=3
08/08/01 10:25:51 INFO mapred.JobClient: Launched reduce tasks=1
08/08/01 10:25:51 INFO mapred.JobClient: Map-Reduce Framework
08/08/01 10:25:51 INFO mapred.JobClient: Map input records=6
08/08/01 10:25:51 INFO mapred.JobClient: Map output records=0
08/08/01 10:25:51 INFO mapred.JobClient: Map input bytes=0
08/08/01 10:25:51 INFO mapred.JobClient: Map output bytes=0
08/08/01 10:25:51 INFO mapred.JobClient: Combine input records=0
08/08/01 10:25:51 INFO mapred.JobClient: Combine output records=0
08/08/01 10:25:51 INFO mapred.JobClient: Reduce input groups=0
08/08/01 10:25:51 INFO mapred.JobClient: Reduce input records=0
08/08/01 10:25:51 INFO mapred.JobClient: Reduce output records=0
08/08/01 10:25:51 INFO mapred.JobClient: File Systems
08/08/01 10:25:51 INFO mapred.JobClient: Local bytes read=136
08/08/01 10:25:51 INFO mapred.JobClient: Local bytes written=568
On Jul 31, 2008, at 5:42 PM, Jean-Daniel Cryans wrote:
Dru,
There is something truly weird with your setup. I would advise
running your
code (the simple one that only logs the rows) with DEBUG on. See the
faq<http://wiki.apache.org/hadoop/Hbase/FAQ#5>on how to do it. Then
get back with syslog and stdout. This way we will have
more informations on how scanners are handling this.
Also FYI, I ran the same code as yours with 0.2.0 on my setup and
had no
problems.
J-D
On Thu, Jul 31, 2008 at 7:06 PM, Dru Jensen <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]
>> wrote:
UPDATE: I modified the RowCounter example and verified that it
is sending
the same row to multiple map tasks also. Is this a known bug or
am I doing
something truly as(s)inine? Any help is appreciated.
On Jul 30, 2008, at 3:02 PM, Dru Jensen wrote:
J-D,
Again, thank you for your help on this.
hitting the HBASE Master port 60010:
System 1 - 2 regions
System 2 - 1 region
System 3 - 3 regions
In order to demonstrate the behavior I'm seeing, I wrote a test
class.
public class Test extends Configured implements Tool {
public static class Map extends TableMap {
@Override
public void map(ImmutableBytesWritable key, RowResult row,
OutputCollector output, Reporter r) throws IOException {
String key_str = new String(key.get());
System.out.println("map: key = " + key_str);
}
}
public class Reduce extends TableReduce {
@Override
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter r) throws IOException {
}
}
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), Test.class);
job.setJobName("Test");
job.setNumMapTasks(4);
job.setNumReduceTasks(1);
Map.initJob("test", "content:", Map.class, HStoreKey.class,
HbaseMapWritable.class, job);
Reduce.initJob("test", Reduce.class, job);
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Test(),
args);
System.exit(res);
}
}
In hbase shell:
create 'test','content'
put 'test','test','content:test','testing'
put 'test','test2','content:test','testing2'
The Hadoop log results:
Task Logs: 'task_200807301447_0001_m_000000_0'
stdout logs
map: key = test
map: key = test2
stderr logs
syslog logs
2008-07-30 14:51:16,410 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2008-07-30 14:51:16,507 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 1
2008-07-30 14:51:17,120 INFO
org.apache.hadoop.mapred.TaskRunner: Task
'task_200807301447_0001_m_000000_0' done.
Task Logs: 'task_200807301447_0001_m_000001_0'
stdout logs
map: key = test
map: key = test2
stderr logs
syslog logs
2008-07-30 14:51:16,410 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2008-07-30 14:51:16,509 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 1
2008-07-30 14:51:17,118 INFO
org.apache.hadoop.mapred.TaskRunner: Task
'task_200807301447_0001_m_000001_0' done.
Tasks 3 and 4 are the same.
Each map task is seeing the same rows. Any help to prevent this
is
appreciated.
Thanks,
Dru
On Jul 30, 2008, at 2:22 PM, Jean-Daniel Cryans wrote:
Dru,
It is not supposed to process many times the same rows. Can I
see the log
you're talking about? Also, how many regions do you have in
your table?
(info available in the web UI).
thx
J-D
On Wed, Jul 30, 2008 at 5:04 PM, Dru Jensen
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
J-D,
thanks for your quick response. I have 4 mapping processes
running on
3
systems.
Are the same rows being processed 4 times by each mapping
processor?
According to the logs they are.
When I run a map/reduce against a file, only one row gets
logged per
mapper. Why would this be different for hbase tables?
I would think only one mapping process would process that one
row and it
would only show up once in only one log.
preferable it would be the same system that has the region.
I only want one row to be processed once. Is there anyway to
change
this
behavior without running only 1 mapper?
thanks,
Dru
On Jul 30, 2008, at 1:44 PM, Jean-Daniel Cryans wrote:
Dru,
The regions will split when achieving a certain threshold so
if you
want
your computing to be distributed, you will have to have more
data.
Regards,
J-D
On Wed, Jul 30, 2008 at 4:36 PM, Dru Jensen <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>>
wrote:
Hello,
I created a map/reduce process by extending the TableMap and
TableReduce
API but for some reason
when I run multiple mappers, in the logs its showing that
the same
rows
are
being processed by each Mapper.
When I say logs, I mean in the hadoop task tracker
(localhost:50030)
and
drilling down into the logs.
Do I need to manually perform a TableSplit or is this
supposed to be
done
automatically?
If its something I need to do manually, can someone point me
to some
sample
code?
If its supposed to be automatic and each mapper was supposed
to get
its
own
set of rows,
should I write up a bug for this? I using trunk 0.2.0 on
hadoop trunk
0.17.2.
thanks,
Dru
=