Nick Dimiduk created HBASE-9675:
-----------------------------------

             Summary: Root-cause analysis is difficult with AsyncProcess
                 Key: HBASE-9675
                 URL: https://issues.apache.org/jira/browse/HBASE-9675
             Project: HBase
          Issue Type: Improvement
          Components: Client
    Affects Versions: 0.95.2
            Reporter: Nick Dimiduk


The AsyncProcess swallows up informative exceptions returned from the RS. For 
instance, diagnosing the cause of a failed mapreduce task is pretty opaque 
right now. The client log tells me that my async requests have been retried and 
rejected, but not why:

{noformat}
2013-09-27 22:47:03,391 INFO [main] 
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of 
running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180, 
currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
// repeat about 30 times
2013-09-27 22:53:13,886 INFO [main] 
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of 
running tasks to be equals or less than 0, tasksSent=12140, tasksDone=12135, 
currentTasksDone=12135, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:13,890 WARN [htable-pool1-t51] 
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops on 
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
 hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,030 WARN [htable-pool1-t51] 
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on 
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
 hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,179 WARN [htable-pool1-t51] 
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on 
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
 hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,251 WARN [htable-pool1-t51] 
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on 
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
 hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,277 WARN [htable-pool1-t51] 
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on 
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
 hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:26,862 INFO [main] 
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of 
running tasks to be equals or less than 0, tasksSent=12186, tasksDone=12185, 
currentTasksDone=12185, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:26,915 INFO [main] 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Closing zookeeper sessionid=0x14160ad6eb404c3
2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 
0x14160ad6eb404c3 closed
2013-09-27 22:53:26,920 INFO [main-EventThread] 
org.apache.zookeeper.ClientCnxn: EventThread shut down
2013-09-27 22:53:26,936 ERROR [main] 
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
as:hrt_qa (auth:SIMPLE) 
cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: 
Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : 
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36 
actions: ConnectException: 4 times, FailedServerException: 32 times, 
        at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
        at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
        at 
org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
        at 
org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
        at 
org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
{noformat}

Likewise, there's nothing "broken" in the RS logs. It's sending back 
RegionTooBusyExceptions but neither RS nor client inform the user of this.

We need to add logging somewhere, probably on the client side, but maybe both, 
to tell our owner-operator what's going on. The client log doesn't say much and 
all you have in the RS log is a whole bunch of flush and compaction activity. 
In this case, it would have been nice to see something like 
"RegionTooBusyException: above the memstore limit". Even better, include a 
recommendation to get around the problem, such as "slow down, tiger, that's 
quite the write throughput" or "maybe you want to increase memstore size a 
bit". Or maybe the right approach is to advise tweaks to the client settings 
(retries, sleep periods, buffer size, &c.)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to