Re: Tow issues for sharing, when using MapReduce to insert rows to HBase

schubert zhang Wed, 18 Mar 2009 00:38:37 -0700

Jean Daniel,
Now, I have running HBase0.19.1  candidate 2  and Hadoop 0.19.2 (The code is
got from hadoop branch-0.19, says the MapReduce bug has been fixed in this
branch.)


I am verifying, and need some hours for examination.

Schubert

On Wed, Mar 18, 2009 at 10:14 AM, schubert zhang <[email protected]> wrote:

> Jean Daniel,
> 1. If I keep the retries at 10, the task may wait for a long time (142
> seconds). So I want to try to configure
>  hbase.client.pause = 2
> hbase.client.retries.number=5
>  Then the waiting time would be 14 seconds.
>
> 2. My DELL 2950, each node have 1 CPU with 4 cores. but there is one node
> (as a slave) have 2 CPU with 2*4=8 cores.
>
> 3. The issue of hadoop 0.19.1 MapReduce is:
>    (1) Some times, some task slots on some tasktracker are un-usable.
>         For example, my total MapTask Capacity is 20, and there are 40
> maps need to be process, but only 16 or 13 running tasks at most.
>         I think may be it is caused by taskracker black-list feature, this
> issue may occur after some failed tasks.
>    (2) I am currently checking the logs of mapreduce, since last night, my
> MapReduce job was hung up. It stop at 79.9% for about 13 hours.
>         The issue like this:
> https://issues.apache.org/jira/browse/HADOOP-5367
>         After a long time running of MapReduce (about 200 jobs have
> completed). The MapReduce job is huaguped forever.
>
>        The JobTracker repeatly logs:
>        2009-03-17 16:29:39,997 INFO org.apache.hadoop.mapred.JobTracker:
> Serious problem. While updating status, cannot find taskid
> attempt_200903171247_0387_m_000015_1
>
>        All TaskTrackers may hungup at the sametime, since the logs of each
> TaskTracer stop at that time.
>
>        And nefore the hangup. I can also find odds and ends such logs of
> JobTracker, such as.
>
>        2009-03-17 16:29:21,767 INFO org.apache.hadoop.mapred.JobTracker:
> Serious problem. While updating status, cannot find taskid
> attempt_200903171247_0387_m_000015_1
> Schubert
>
>
> On Wed, Mar 18, 2009 at 9:46 AM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> Schubert.
>>
>> Regards your previous email, I didn't notice something obvious. You
>> should keep the retries at 10 since you could easily have to wait more
>> than 10 seconds under heavy load.
>>
>> How many CPUs do you have on your Dell 2950?
>>
>> Regards your last email, HBase 0.19 isn't compatible with Hadoop 0.18.
>> Why do you say it's buggy?
>>
>> Thx,
>>
>> J-D
>>
>> On Tue, Mar 17, 2009 at 9:33 PM, schubert zhang <[email protected]>
>> wrote:
>> > Hi Jean Daniel,
>> > I want to try the HBase0.19.1.
>> > And since the mapreduce of hadoop-0.19.1 is buggy, I want to use a
>> stable
>> > hadoop 0.18.3, is the HBase 0.19.x can work on hadoop 0.18.3?
>> >
>> > Schubert
>> >
>> > On Wed, Mar 18, 2009 at 12:26 AM, schubert zhang <[email protected]>
>> wrote:
>> >
>> >> Jean Daniel,
>> >> Thank you very much and I will re-setup my cluster with HBase 0.19.1
>> >> immediately.
>> >> And I will change my swappiness configuration immediately.
>> >>
>> >> For the RetriesExhaustedException exception, I double checked the logs
>> of
>> >> each node.
>> >>
>> >> for example one such exception occur at 09/03/17 16:29:10 for Region:
>> >> 13975557...@2009-03-17 13:42:36.412
>> >>
>> >> 09/03/17 16:29:10 INFO mapred.JobClient:  map 69% reduce 0%
>> >> 09/03/17 16:29:10 INFO mapred.JobClient: Task Id :
>> >> attempt_200903171247_0387_m_000009_0, Status : FAILED
>> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> contact
>> >> region server Some server for region 
>> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007,
>> row '13975559...@2009-03-1716:25:42.772', but failed after 4 attempts.
>> >> Exceptions:
>> >>
>> >>         at
>> >>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>> >>         at
>> >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>> >>         at
>> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>> >>         at
>> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>> >>         at
>> >>
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>> >>         at
>> >>
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>> >>         at
>> >>
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>> >>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:123)
>> >>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>> >>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>> >>
>> >> HMaster log:
>> >> 2009-03-17 16:29:16,501 INFO
>> org.apache.hadoop.hbase.master.ServerManager:
>> >> Received MSG_REPORT_CLOSE: 
>> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
>> from
>> >> 10.24.1.20:60020
>> >> 2009-03-17 16:29:18,707 INFO
>> org.apache.hadoop.hbase.master.RegionManager:
>> >> assigning region TABA,13975557...@2009-03-1713:42:36.412,1237270434007 to
>> >> server 10.24.1.14:60020
>> >> 2009-03-17 16:29:19,795 INFO
>> org.apache.hadoop.hbase.master.BaseScanner:
>> >> RegionManager.metaScanner scanning meta region {regionname: .META.,,1,
>> >> startKey: <>, server: 10.24.1.12:60020}
>> >> 2009-03-17 16:29:19,857 INFO
>> org.apache.hadoop.hbase.master.BaseScanner:
>> >> RegionManager.metaScanner scan of 34 row(s) of meta region {regionname:
>> >> .META.,,1, startKey: <>, server: 10.24.1.12:60020} complete
>> >> 2009-03-17 16:29:19,857 INFO
>> org.apache.hadoop.hbase.master.BaseScanner:
>> >> All 1 .META. region(s) scanned
>> >> 2009-03-17 16:29:21,723 INFO
>> org.apache.hadoop.hbase.master.ServerManager:
>> >> Received MSG_REPORT_PROCESS_OPEN: 
>> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
>> from
>> >> 10.24.1.14:60020
>> >> 2009-03-17 16:29:21,723 INFO
>> org.apache.hadoop.hbase.master.ServerManager:
>> >> Received MSG_REPORT_OPEN: 
>> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
>> from
>> >> 10.24.1.14:60020
>> >> 2009-03-17 16:29:21,723 INFO
>> >> org.apache.hadoop.hbase.master.ProcessRegionOpen$1:
>> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on
>> >> 10.24.1.14:60020
>> >> 2009-03-17 16:29:21,723 INFO
>> >> org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row
>> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region
>> .META.,,1
>> >> with startcode 1237274685627 and server 10.24.1.14:60020
>> >>
>> >> one node:
>> >> 2009-03-17 16:28:43,617 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
>> >> 2009-03-17 16:28:43,617 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
>> MSG_REGION_OPEN:
>> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
>> >> 2009-03-17 16:28:44,604 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> region TABA,13975557...@2009-03-1713:42:36.412,1237270434007/168627948
>> >> available
>> >> 2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog:
>> >> Closed
>> >>
>> hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283,
>> >> entries=100008. New log writer:
>> >> /hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680
>> >> 2009-03-17 16:29:04,920 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> compaction completed on region 
>> >> TABA,13675519...@2009-03-1713:39:35.123,1237270466558
>> in 28sec
>> >> 2009-03-17 16:29:04,920 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> starting  compaction on region TABA,13702099...@2009-03-1714
>> :00:04.791,1237270570714
>> >> 2009-03-17 16:29:04,927 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> compaction completed on region 
>> >> TABA,13702099...@2009-03-1714:00:04.791,1237270570714
>> in 0sec
>> >>
>> >> another node:
>> >> 2009-03-17 16:28:56,462 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> starting  compaction on region CDR,13975557...@2009-03-1713
>> :42:36.412,1237270434007
>> >> 2009-03-17 16:28:58,304 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> >> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
>> >> 2009-03-17 16:28:58,305 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> >> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
>> >> 2009-03-17 16:28:58,305 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
>> MSG_REGION_OPEN:
>> >> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
>> >> 2009-03-17 16:28:58,343 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> region CDR,13775569...@2009-03-1713:13:57.898,1237278535949/1626944070
>> >> available
>> >> 2009-03-17 16:28:58,343 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
>> MSG_REGION_OPEN:
>> >> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
>> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool:
>> Got
>> >> brand-new decompressor
>> >> 2009-03-17 16:28:58,388 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> region CDR,13775545...@2009-03-1713:03:35.830,1237278535949/1656930599
>> >> available
>> >> 2009-03-17 16:29:01,325 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
>> >> CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded
>> >>
>> >> On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans <
>> [email protected]>wrote:
>> >>
>> >>> Schubert,
>> >>>
>> >>> Yeah latency was always an issue in HBase since reliability was our
>> >>> focus. This is going to be fixed in 0.20 with a new file format and
>> >>> caching of hot cells.
>> >>>
>> >>> Regards your problem, please enable DEBUG and look in your
>> >>> regionserver logs to see what's happening. Usually, if you get a
>> >>> RetriesExhausted, it's because something is taking too long to get a
>> >>> region reassigned.
>> >>>
>> >>> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
>> >>> 2 fix for massive uploads like yours.
>> >>>
>> >>> I also see that there was some swap, please try setting the swappiness
>> >>> to something like 20. It helped me a lot!
>> >>> http://www.sollers.ca/blog/2008/swappiness/
>> >>>
>> >>> J-D
>> >>>
>> >>> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]>
>> >>> wrote:
>> >>> > This the "top" info of a regionserver/datanode/tasktracker node.We
>> can
>> >>> see
>> >>> > the HRegionServer node is very heavy-loaded.
>> >>> >
>> >>> > top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
>> >>> > Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
>> >>> > Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,
>>  1.2%si,
>> >>> >  0.0%st
>> >>> > Mem:   4043484k total,  2816028k used,  1227456k free,    18308k
>> buffers
>> >>> > Swap:  2097144k total,    22944k used,  2074200k free,  1601760k
>> cached
>> >>> >
>> >>> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>> >
>> >>> > 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
>> >>> > (HRegionServer)
>> >>> > 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java
>> >>> (DataNode)
>> >>> >
>> >>> > 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java
>> >>> (MapReduce
>> >>> > Child)
>> >>> >
>> >>> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi all,
>> >>> >>
>> >>> >> I am running a MapReduce Job to read files and insert rows into a
>> HBase
>> >>> table. It like a ETL procedure in database world.
>> >>> >>
>> >>> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950
>> server
>> >>> with 4GB memory and 1TB disk on each node.
>> >>> >>
>> >>> >> Issue 1.
>> >>> >>
>> >>> >> Each time the MapReduce Job eat some files. And I want to get a
>> balance
>> >>> in this pipe-line (I have defined the HBase table with TTL) since many
>> new
>> >>> incoming files need to be eat. But sometimes many following
>> >>> RetriesExhaustedException happen and the job is blocked for a long
>> time. And
>> >>> the the MapReduce job will be behindhand and cannot catch up with the
>> speed
>> >>> of incoming files.
>> >>> >>
>> >>> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> >>> contact region server Some server for region
>> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row
>> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts.
>> >>> >>
>> >>> >> Exceptions:
>> >>> >>
>> >>> >>       at
>> >>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>> >>> >>       at
>> >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>> >>> >>       at
>> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>> >>> >>       at
>> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>> >>> >>       at
>> >>>
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>> >>> >>       at
>> >>>
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>> >>> >>       at
>> >>>
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>> >>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
>> >>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>> >>> >>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> >>> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>> >>> >>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
>> >>> >>
>> >>> >> I check the code and find the retry policy when this exception is
>> like
>> >>> BSD TCP syn backoff policy:
>> >>> >>
>> >>> >>   /**
>> >>> >>    * This is a retry backoff multiplier table similar to the BSD
>> TCP
>> >>> syn
>> >>> >>    * backoff table, a bit more aggressive than simple exponential
>> >>> backoff.
>> >>> >>    */
>> >>> >>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16,
>> 32
>> >>> };
>> >>> >>
>> >>> >> So, 10 retries will table long time: 71*2=142 seconds. So I think
>> >>> following
>> >>> >> tow parameters can be changed and I reduce them.
>> >>> >> hbase.client.pause = 1 (second)
>> >>> >> hbase.client.retries.number=4
>> >>> >> Now, the block time should be 5 seconds.
>> >>> >>
>> >>> >> In fact, sometimes I have met such exception also when:
>> >>> >> getRegionServerWithRetries, and
>> >>> >> getRegionLocationForRowWithRetries, and
>> >>> >> processBatchOfRows
>> >>> >>
>> >>> >> I want to know what cause above exception happen. Region spliting?
>> >>> >> Compaction? or Region moving to other regionserver?
>> >>> >> This exception happens very frequently in my cluster.
>> >>> >>
>> >>> >> Issue2: Also above workflow, I found the regionserver nodes are
>> very
>> >>> busy
>> >>> >> wigh very heavy load.
>> >>> >> The process of HRegionServer takes almost all of the used CPU load.
>> >>> >> In my previous test round, I ran 4 maptask to insert data in each
>> node,
>> >>> now
>> >>> >> I just run 2 maptask in each node. I think the performance of
>> >>> HRegionServer
>> >>> >> is not so good to achieve some of my applications. Because except
>> for
>> >>> the
>> >>> >> insert/input data procedure, we will query rows from the same table
>> for
>> >>> >> online applications, which need low-latency. I my test experience,
>> I
>> >>> find
>> >>> >> the online query (scan about 50 rows with startRow and endRow)
>> latency
>> >>> is
>> >>> >> 2-6 seconds for the first query, and about 500ms for the re-query.
>> And
>> >>> when
>> >>> >> the cluster is busy for inserting data to table, the query latency
>> may
>> >>> as
>> >>> >> long as 10-30 seconds. And when the table is in sliping/compaction,
>> >>> maybe
>> >>> >> longer.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Tow issues for sharing, when using MapReduce to insert rows to HBase

Reply via email to