[ 
https://issues.apache.org/jira/browse/HBASE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated HBASE-9675:
--------------------------------

    Priority: Critical  (was: Major)

Bumping to critical because the user is flying blind currently. Feel free to 
adjust if you disagree.
                
> Root-cause analysis is difficult with AsyncProcess
> --------------------------------------------------
>
>                 Key: HBASE-9675
>                 URL: https://issues.apache.org/jira/browse/HBASE-9675
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 0.95.2
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> The AsyncProcess swallows up informative exceptions returned from the RS. For 
> instance, diagnosing the cause of a failed mapreduce task is pretty opaque 
> right now. The client log tells me that my async requests have been retried 
> and rejected, but not why:
> {noformat}
> 2013-09-27 22:47:03,391 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180, 
> currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
> // repeat about 30 times
> 2013-09-27 22:53:13,886 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=12140, 
> tasksDone=12135, currentTasksDone=12135, 
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:13,890 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,030 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,179 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,251 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,277 WARN [htable-pool1-t51] 
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops 
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT 
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
>  hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:26,862 INFO [main] 
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number 
> of running tasks to be equals or less than 0, tasksSent=12186, 
> tasksDone=12185, currentTasksDone=12185, 
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:26,915 INFO [main] 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
> Closing zookeeper sessionid=0x14160ad6eb404c3
> 2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 
> 0x14160ad6eb404c3 closed
> 2013-09-27 22:53:26,920 INFO [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-09-27 22:53:26,936 ERROR [main] 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:hrt_qa (auth:SIMPLE) 
> cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: 
> Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 
> times, 
> 2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
>       at 
> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
>       at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
>       at 
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
>       at 
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> {noformat}
> Likewise, there's nothing "broken" in the RS logs. It's sending back 
> RegionTooBusyExceptions but neither RS nor client inform the user of this.
> We need to add logging somewhere, probably on the client side, but maybe 
> both, to tell our owner-operator what's going on. The client log doesn't say 
> much and all you have in the RS log is a whole bunch of flush and compaction 
> activity. In this case, it would have been nice to see something like 
> "RegionTooBusyException: above the memstore limit". Even better, include a 
> recommendation to get around the problem, such as "slow down, tiger, that's 
> quite the write throughput" or "maybe you want to increase memstore size a 
> bit". Or maybe the right approach is to advise tweaks to the client settings 
> (retries, sleep periods, buffer size, &c.)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to