Jean Daniel,
Thank you very much and I will re-setup my cluster with HBase 0.19.1
immediately.
And I will change my swappiness configuration immediately.

For the RetriesExhaustedException exception, I double checked the logs of
each node.

for example one such exception occur at 09/03/17 16:29:10 for Region:
13975557...@2009-03-17 13:42:36.412

09/03/17 16:29:10 INFO mapred.JobClient:  map 69% reduce 0%
09/03/17 16:29:10 INFO mapred.JobClient: Task Id :
attempt_200903171247_0387_m_000009_0, Status : FAILED
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server Some server for region
TABA,13975557...@2009-03-1713:42:36.412,1237270434007, row
'13975559...@2009-03-1716:25:42.772', but failed after 4 attempts.
Exceptions:

        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
        at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
        at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
        at
net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
        at
net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
        at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
        at net.sandmill.control.IndexingJob.map(IndexingJob.java:123)
        at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

HMaster log:
2009-03-17 16:29:16,501 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_CLOSE:
TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
10.24.1.20:60020
2009-03-17 16:29:18,707 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 to
server 10.24.1.14:60020
2009-03-17 16:29:19,795 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {regionname: .META.,,1,
startKey: <>, server: 10.24.1.12:60020}
2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 34 row(s) of meta region {regionname:
.META.,,1, startKey: <>, server: 10.24.1.12:60020} complete
2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN:
TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
10.24.1.14:60020
2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
10.24.1.14:60020
2009-03-17 16:29:21,723 INFO
org.apache.hadoop.hbase.master.ProcessRegionOpen$1:
TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on
10.24.1.14:60020
2009-03-17 16:29:21,723 INFO
org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row
TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region .META.,,1
with startcode 1237274685627 and server 10.24.1.14:60020

one node:
2009-03-17 16:28:43,617 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
2009-03-17 16:28:43,617 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
2009-03-17 16:28:44,604 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007/168627948
available
2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog:
Closed
hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283,
entries=100008. New log writer:
/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680
2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
TABA,13675519...@2009-03-1713:39:35.123,1237270466558 in 28sec
2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting  compaction on region
TABA,13702099...@2009-03-1714:00:04.791,1237270570714
2009-03-17 16:29:04,927 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
TABA,13702099...@2009-03-1714:00:04.791,1237270570714 in 0sec

another node:
2009-03-17 16:28:56,462 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting  compaction on region
CDR,13975557...@2009-03-1713:42:36.412,1237270434007
2009-03-17 16:28:58,304 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
2009-03-17 16:28:58,305 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
2009-03-17 16:28:58,305 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
2009-03-17 16:28:58,343 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region CDR,13775569...@2009-03-17 13:13:57.898,1237278535949/1626944070
available
2009-03-17 16:28:58,343 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2009-03-17 16:28:58,388 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region CDR,13775545...@2009-03-17 13:03:35.830,1237278535949/1656930599
available
2009-03-17 16:29:01,325 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded

On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans <[email protected]>wrote:

> Schubert,
>
> Yeah latency was always an issue in HBase since reliability was our
> focus. This is going to be fixed in 0.20 with a new file format and
> caching of hot cells.
>
> Regards your problem, please enable DEBUG and look in your
> regionserver logs to see what's happening. Usually, if you get a
> RetriesExhausted, it's because something is taking too long to get a
> region reassigned.
>
> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
> 2 fix for massive uploads like yours.
>
> I also see that there was some swap, please try setting the swappiness
> to something like 20. It helped me a lot!
> http://www.sollers.ca/blog/2008/swappiness/
>
> J-D
>
> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]> wrote:
> > This the "top" info of a regionserver/datanode/tasktracker node.We can
> see
> > the HRegionServer node is very heavy-loaded.
> >
> > top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
> > Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
> > Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,  1.2%si,
> >  0.0%st
> > Mem:   4043484k total,  2816028k used,  1227456k free,    18308k buffers
> > Swap:  2097144k total,    22944k used,  2074200k free,  1601760k cached
> >
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >
> > 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
> > (HRegionServer)
> > 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java
> (DataNode)
> >
> > 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java
> (MapReduce
> > Child)
> >
> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]>
> wrote:
> >
> >> Hi all,
> >>
> >> I am running a MapReduce Job to read files and insert rows into a HBase
> table. It like a ETL procedure in database world.
> >>
> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server
> with 4GB memory and 1TB disk on each node.
> >>
> >> Issue 1.
> >>
> >> Each time the MapReduce Job eat some files. And I want to get a balance
> in this pipe-line (I have defined the HBase table with TTL) since many new
> incoming files need to be eat. But sometimes many following
> RetriesExhaustedException happen and the job is blocked for a long time. And
> the the MapReduce job will be behindhand and cannot catch up with the speed
> of incoming files.
> >>
> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server Some server for region 
> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row 
> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts.
> >>
> >> Exceptions:
> >>
> >>       at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
> >>       at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
> >>       at
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
> >>       at
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
> >>       at
> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
> >>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> >>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>
> >> I check the code and find the retry policy when this exception is like
> BSD TCP syn backoff policy:
> >>
> >>   /**
> >>    * This is a retry backoff multiplier table similar to the BSD TCP syn
> >>    * backoff table, a bit more aggressive than simple exponential
> backoff.
> >>    */
> >>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32
> };
> >>
> >> So, 10 retries will table long time: 71*2=142 seconds. So I think
> following
> >> tow parameters can be changed and I reduce them.
> >> hbase.client.pause = 1 (second)
> >> hbase.client.retries.number=4
> >> Now, the block time should be 5 seconds.
> >>
> >> In fact, sometimes I have met such exception also when:
> >> getRegionServerWithRetries, and
> >> getRegionLocationForRowWithRetries, and
> >> processBatchOfRows
> >>
> >> I want to know what cause above exception happen. Region spliting?
> >> Compaction? or Region moving to other regionserver?
> >> This exception happens very frequently in my cluster.
> >>
> >> Issue2: Also above workflow, I found the regionserver nodes are very
> busy
> >> wigh very heavy load.
> >> The process of HRegionServer takes almost all of the used CPU load.
> >> In my previous test round, I ran 4 maptask to insert data in each node,
> now
> >> I just run 2 maptask in each node. I think the performance of
> HRegionServer
> >> is not so good to achieve some of my applications. Because except for
> the
> >> insert/input data procedure, we will query rows from the same table for
> >> online applications, which need low-latency. I my test experience, I
> find
> >> the online query (scan about 50 rows with startRow and endRow) latency
> is
> >> 2-6 seconds for the first query, and about 500ms for the re-query. And
> when
> >> the cluster is busy for inserting data to table, the query latency may
> as
> >> long as 10-30 seconds. And when the table is in sliping/compaction,
> maybe
> >> longer.
> >>
> >>
> >>
> >
>

Reply via email to