This helps, thanks. I decreased hbase.hregion.max.filesize to 67M and increased the size of my table to around 500,000 so I finally get several tasks.
However, they don't seem to be parallel (see below) am I doing anything wrong or is that the way it supposed to be? Thanks -Yair 08/07/08 22:27:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName =JobTracker, sessionId= 08/07/08 22:27:38 INFO mapred.JobClient: Running job: job_local_1 08/07/08 22:27:38 INFO mapred.MapTask: numReduceTasks: 1 08/07/08 22:27:38 INFO hbase.HTable: Creating scanner over ase starting at key 08/07/08 22:27:39 INFO mapred.JobClient: map 0% reduce 0% 08/07/08 22:27:44 INFO mapred.LocalJobRunner: 08/07/08 22:27:47 INFO mapred.LocalJobRunner: 08/07/08 22:27:50 INFO mapred.LocalJobRunner: 08/07/08 22:27:53 INFO mapred.LocalJobRunner: 08/07/08 22:27:56 INFO mapred.LocalJobRunner: 08/07/08 22:27:59 INFO mapred.LocalJobRunner: 08/07/08 22:28:02 INFO mapred.LocalJobRunner: 08/07/08 22:28:05 INFO mapred.LocalJobRunner: 08/07/08 22:28:08 INFO mapred.LocalJobRunner: 08/07/08 22:28:11 INFO mapred.LocalJobRunner: 08/07/08 22:28:14 INFO mapred.LocalJobRunner: 08/07/08 22:28:17 INFO mapred.LocalJobRunner: 08/07/08 22:28:20 INFO mapred.LocalJobRunner: 08/07/08 22:28:20 INFO mapred.TaskRunner: Task 'job_local_1_map_0000' done. 08/07/08 22:28:20 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_ 0000' to file:/home/hadoop/TestHbase/results 08/07/08 22:28:20 INFO mapred.MapTask: numReduceTasks: 1 08/07/08 22:28:20 INFO mapred.JobClient: map 100% reduce 0% 08/07/08 22:28:20 INFO hbase.HTable: Creating scanner over ase starting at key J n3ae1DL-goAAABprCQA 08/07/08 22:28:20 INFO mapred.LocalJobRunner: 08/07/08 22:28:21 INFO mapred.JobClient: map 0% reduce 0% 08/07/08 22:28:26 INFO mapred.LocalJobRunner: 08/07/08 22:28:26 INFO mapred.JobClient: map 50% reduce 0% 08/07/08 22:28:29 INFO mapred.LocalJobRunner: 08/07/08 22:28:32 INFO mapred.LocalJobRunner: 08/07/08 22:28:35 INFO mapred.LocalJobRunner: 08/07/08 22:28:38 INFO mapred.LocalJobRunner: 08/07/08 22:28:41 INFO mapred.LocalJobRunner: 08/07/08 22:28:44 INFO mapred.LocalJobRunner: 08/07/08 22:28:47 INFO mapred.LocalJobRunner: 08/07/08 22:28:50 INFO mapred.LocalJobRunner: 08/07/08 22:28:53 INFO mapred.LocalJobRunner: 08/07/08 22:28:56 INFO mapred.LocalJobRunner: 08/07/08 22:28:59 INFO mapred.LocalJobRunner: 08/07/08 22:29:01 INFO mapred.LocalJobRunner: 08/07/08 22:29:01 INFO mapred.TaskRunner: Task 'job_local_1_map_0001' done. 08/07/08 22:29:01 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_ 0001' to file:/home/hadoop/TestHbase/results 08/07/08 22:29:01 INFO mapred.MapTask: numReduceTasks: 1 08/07/08 22:29:01 INFO hbase.HTable: Creating scanner over ase starting at key a vBRwzXL-goAAAAVKkMA 08/07/08 22:29:01 INFO mapred.JobClient: map 100% reduce 0% 08/07/08 22:29:02 INFO mapred.LocalJobRunner: 08/07/08 22:29:02 INFO mapred.JobClient: map 33% reduce 0% 08/07/08 22:29:07 INFO mapred.LocalJobRunner: 08/07/08 22:29:07 INFO mapred.JobClient: map 66% reduce 0% 08/07/08 22:29:10 INFO mapred.LocalJobRunner: 08/07/08 22:29:13 INFO mapred.LocalJobRunner: 08/07/08 22:29:16 INFO mapred.LocalJobRunner: 08/07/08 22:29:19 INFO mapred.LocalJobRunner: 08/07/08 22:29:22 INFO mapred.LocalJobRunner: 08/07/08 22:29:25 INFO mapred.LocalJobRunner: 08/07/08 22:29:28 INFO mapred.LocalJobRunner: 08/07/08 22:29:31 INFO mapred.LocalJobRunner: 08/07/08 22:29:34 INFO mapred.LocalJobRunner: 08/07/08 22:29:37 INFO mapred.LocalJobRunner: 08/07/08 22:29:39 INFO mapred.LocalJobRunner: 08/07/08 22:29:39 INFO mapred.TaskRunner: Task 'job_local_1_map_0002' done. 08/07/08 22:29:39 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_ 0002' to file:/home/hadoop/TestHbase/results 08/07/08 22:29:39 INFO mapred.MapTask: numReduceTasks: 1 08/07/08 22:29:39 INFO hbase.HTable: Creating scanner over ase starting at key k 90QjEfL-goAAAAwfmsA 08/07/08 22:29:40 INFO mapred.JobClient: map 100% reduce 0% 08/07/08 22:29:40 INFO mapred.LocalJobRunner: 08/07/08 22:29:41 INFO mapred.JobClient: map 50% reduce 0% 08/07/08 22:29:45 INFO mapred.LocalJobRunner: 08/07/08 22:29:46 INFO mapred.JobClient: map 75% reduce 0% 08/07/08 22:29:48 INFO mapred.LocalJobRunner: 08/07/08 22:29:51 INFO mapred.LocalJobRunner: 08/07/08 22:29:54 INFO mapred.LocalJobRunner: 08/07/08 22:29:57 INFO mapred.LocalJobRunner: 08/07/08 22:30:00 INFO mapred.LocalJobRunner: 08/07/08 22:30:03 INFO mapred.LocalJobRunner: 08/07/08 22:30:06 INFO mapred.LocalJobRunner: 08/07/08 22:30:09 INFO mapred.LocalJobRunner: 08/07/08 22:30:12 INFO mapred.LocalJobRunner: 08/07/08 22:30:15 INFO mapred.LocalJobRunner: 08/07/08 22:30:18 INFO mapred.LocalJobRunner: 08/07/08 22:30:20 INFO mapred.LocalJobRunner: 08/07/08 22:30:20 INFO mapred.TaskRunner: Task 'job_local_1_map_0003' done. 08/07/08 22:30:20 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_ 0003' to file:/home/hadoop/TestHbase/results 08/07/08 22:30:21 INFO mapred.JobClient: map 100% reduce 0% 08/07/08 22:30:21 INFO mapred.LocalJobRunner: 08/07/08 22:30:22 INFO mapred.JobClient: map 75% reduce 0% 08/07/08 22:30:26 INFO mapred.LocalJobRunner: reduce > reduce 08/07/08 22:30:27 INFO mapred.JobClient: map 75% reduce 75% 08/07/08 22:30:29 INFO mapred.LocalJobRunner: reduce > reduce 08/07/08 22:30:30 INFO mapred.JobClient: map 75% reduce 80% 08/07/08 22:30:32 INFO mapred.LocalJobRunner: reduce > reduce 08/07/08 22:30:33 INFO mapred.JobClient: map 75% reduce 85% 08/07/08 22:30:35 INFO mapred.LocalJobRunner: reduce > reduce 08/07/08 22:30:36 INFO mapred.JobClient: map 75% reduce 90% 08/07/08 22:30:36 INFO mapred.LocalJobRunner: reduce > reduce 08/07/08 22:30:36 INFO mapred.TaskRunner: Task 'reduce_vz4c9o' done. 08/07/08 22:30:36 INFO mapred.TaskRunner: Saved output of task 'reduce_vz4c9o' t o file:/home/hadoop/TestHbase/results 08/07/08 22:30:37 INFO mapred.JobClient: Job complete: job_local_1 08/07/08 22:30:37 INFO mapred.JobClient: Counters: 10 08/07/08 22:30:37 INFO mapred.JobClient: Map-Reduce Framework 08/07/08 22:30:37 INFO mapred.JobClient: Map input records=496372 08/07/08 22:30:37 INFO mapred.JobClient: Map output records=496372 08/07/08 22:30:37 INFO mapred.JobClient: Map input bytes=0 08/07/08 22:30:37 INFO mapred.JobClient: Map output bytes=13953983 08/07/08 22:30:37 INFO mapred.JobClient: Combine input records=0 08/07/08 22:30:37 INFO mapred.JobClient: Combine output records=0 08/07/08 22:30:37 INFO mapred.JobClient: Reduce input groups=496372 08/07/08 22:30:37 INFO mapred.JobClient: Reduce input records=496372 08/07/08 22:30:37 INFO mapred.JobClient: Reduce output records=496372 08/07/08 22:30:37 INFO mapred.JobClient: com.revenuescience.sandbox.hbase.RowC ounter$Counters 08/07/08 22:30:37 INFO mapred.JobClient: ROWS=496372 -----Original Message----- From: Andrew Purtell [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 09, 2008 1:38 PM To: Yair Even-Zohar Cc: [email protected] Subject: RE: Slow mapreduce using Hbase , regardless on number of machines New HBase tables start with one region. The default split point -- when existing region(s) are split into more regions -- is when the size of the backing store file for any column family of the table exceeds 256MB. Until the table splits, you are guaranteed that only one RegionServer will be serving the table. Furthermore, the TableMap utility class configures the number of map operations for a job equal to the number of regions for a table. Taking into account I/O considerations, this makes sense. One way to speed the process of splitting a table into multiple regions is to adjust the hbase.hregion.max.filesize configuration parameter downward. I would advise that this value should not be set smaller than the DFS blocksize. Even so, until you store a substantial amount of data into your test table(s), there is not much if any parallelism available and furthermore you incur the overhead of Hadoop job scheduling. Hope this helps, - Andy --- On Wed, 7/9/08, Yair Even-Zohar <[EMAIL PROTECTED]> wrote: > From: Yair Even-Zohar <[EMAIL PROTECTED]> > Subject: RE: Slow mapreduce using Hbase , regardless on number of machines > To: [email protected] > Date: Wednesday, July 9, 2008, 9:30 AM > How do I find the number of regions for an HTable? > In a quick lookup I did on the actual machines, it seems > that all the > machine had new data in them once I load the table. > > Thanks > -Yair > > -----Original Message----- > From: Bryan Duxbury [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 09, 2008 11:13 AM > To: [email protected] > Subject: Re: Slow mapreduce using Hbase , regardless on > number of > machines > > How many regions are there in your table? If your 200k > regions fits > inside a single region, adding more region servers > isn't going to > make anything faster because only one server will be > participating. > > -Bryan > > On Jul 9, 2008, at 7:36 AM, yair even-zohar wrote: > > > I am testing HBase 0.1.2 and am getting the following > performance > > using RowCounter class (I had to modify the main() > method of the > > original class because it contains some hardcoded > parameters :-) > > > > Single regionserver - counting 200,000 lines in 60 or > 61 seconds > > 5 regieonservers - counting 200,000 lines in 55 or 58 > seconds > > > > Clearly, one expects better performance, so I assume > I'm doing > > something wrong. By the way, I'm getting about the > same performance > > when I'm iterating through a scanner without the > mapreduce. > > > > Here is my hadoop-site.xml > > > > <configuration> > > <property> > > <name>fs.default.name</name> > > > <value>hdfs://sb-centercluster01:9100</value> > > </property> > > <property> > > <name>mapred.job.tracker</name> > > > <value>hdfs://sb-centercluster01:9101</value> > > </property> > > <property> > > <name>mapred.map.tasks</name> > > <value>13</value> > > </property> > > <property> > > <name>mapred.reduce.tasks</name> > > <value>5</value> > > </property> > > <property> > > <name>dfs.replication</name> > > <value>3</value> > > </property> > > <property> > > <name>dfs.name.dir</name> > > > <value>/home/hadoop/dfs16,/tmp/hadoop/dfs16</value> > > </property> > > <property> > > <name>dfs.data.dir</name> > > > <value>/state/partition1/hadoop/dfs16</value> > > </property> > > </configuration> > > > > Increasing "io.bytes.per.checksum" and > "io.file.buffer.size" didn't > > help. Neither decreasing "dfs.replication" > > > > Here is my hbase-site.xml > > > > <configuration> > > <property> > > <name>hbase.master</name> > > > <value>sb-centercluster01:60002</value> > > <description>The host and port that the > HBase master runs at. > > </description> > > </property> > > <property> > > <name>hbase.rootdir</name> > > > <value>hdfs://sb-centercluster01:9100/hbase</value> > > <description>The directory shared by region > servers. > > </description> > > </property> > > <property> > > <name>hbase.io.index.interval</name> > > <value>8</value> > > </property> > > </configuration> > > > > > > Any help will be appreciated. > > > > Thanks > > -Yair > > > > > >
