[jira] [Commented] (HBASE-9675) Root-cause analysis is difficult with AsyncProcess

Nicolas Liochon (JIRA) Mon, 30 Sep 2013 09:15:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781973#comment-13781973
 ]


Nicolas Liochon commented on HBASE-9675:
----------------------------------------

The info is supposed to be there:
2013-09-27 22:53:26,936 ERROR [main] 
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
as:hrt_qa (auth:SIMPLE) 
cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: 
Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 

What's strange here is that:
- the FailedServerException means that the server is known as dead: so the 
client doesn't try to connect.
- ConnectException is widely used, but usually it means that we tried to 
connect but the server timeouted or refused the connection.

So it's not at all related to 'RegionTooBusyException'. The error says the 
server is dead.

One of the issues with RetriesExhaustedWithDetailsException is that it gives 
only the status of the last error; i.e; if we retry 10 actions 31 times, the 
message is about the 10 final errors, that can be different from the 31 * 10 
errors you had on the "path to failure" (for example, you can image to have 
something like RegionMoved --> RegionNotOpenYet --> RegionTooBusy --> 
ConnectException --> FailedServerException. This history is not shown. 

We could try to do all the history, and then group by common history, but in 
most cases it could mean printing one line per action, as on 31 retries it's 
very likely that they will differ. But I can give it a try if you're ok with 
the approach.


> Root-cause analysis is difficult with AsyncProcess
> --------------------------------------------------
>
>                 Key: HBASE-9675
>                 URL: https://issues.apache.org/jira/browse/HBASE-9675
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 0.95.2
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> The AsyncProcess swallows up informative exceptions returned from the RS. For 
> instance, diagnosing the cause of a failed mapreduce task is pretty opaque 
> right now. The client log tells me that my async requests have been retried 
> and rejected, but not why:
> {noformat}
> 2013-09-27 22:47:03,391 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180, 
> currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
> // repeat about 30 times
> 2013-09-27 22:53:13,886 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=12140, 
> tasksDone=12135, currentTasksDone=12135, 
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:13,890 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,030 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,179 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,251 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,277 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:26,862 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=12186, 
> tasksDone=12185, currentTasksDone=12185, 
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:26,915 INFO [main] 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
> Closing zookeeper sessionid=0x14160ad6eb404c3
> 2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 
> 0x14160ad6eb404c3 closed
> 2013-09-27 22:53:26,920 INFO [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-09-27 22:53:26,936 ERROR [main] 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:hrt_qa (auth:SIMPLE) 
> cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: 
> Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 
> times, 
> 2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
>       at 
> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
>       at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
>       at 
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
>       at 
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> {noformat}
> Likewise, there's nothing "broken" in the RS logs. It's sending back 
> RegionTooBusyExceptions but neither RS nor client inform the user of this.
> We need to add logging somewhere, probably on the client side, but maybe 
> both, to tell our owner-operator what's going on. The client log doesn't say 
> much and all you have in the RS log is a whole bunch of flush and compaction 
> activity. In this case, it would have been nice to see something like 
> "RegionTooBusyException: above the memstore limit". Even better, include a 
> recommendation to get around the problem, such as "slow down, tiger, that's 
> quite the write throughput" or "maybe you want to increase memstore size a 
> bit". Or maybe the right approach is to advise tweaks to the client settings 
> (retries, sleep periods, buffer size, &c.)?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HBASE-9675) Root-cause analysis is difficult with AsyncProcess

Reply via email to