[
https://issues.apache.org/jira/browse/HBASE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk updated HBASE-9675:
--------------------------------
Priority: Critical (was: Major)
Bumping to critical because the user is flying blind currently. Feel free to
adjust if you disagree.
> Root-cause analysis is difficult with AsyncProcess
> --------------------------------------------------
>
> Key: HBASE-9675
> URL: https://issues.apache.org/jira/browse/HBASE-9675
> Project: HBase
> Issue Type: Improvement
> Components: Client
> Affects Versions: 0.95.2
> Reporter: Nick Dimiduk
> Priority: Critical
>
> The AsyncProcess swallows up informative exceptions returned from the RS. For
> instance, diagnosing the cause of a failed mapreduce task is pretty opaque
> right now. The client log tells me that my async requests have been retried
> and rejected, but not why:
> {noformat}
> 2013-09-27 22:47:03,391 INFO [main]
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number
> of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180,
> currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
> // repeat about 30 times
> 2013-09-27 22:53:13,886 INFO [main]
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number
> of running tasks to be equals or less than 0, tasksSent=12140,
> tasksDone=12135, currentTasksDone=12135,
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:13,890 WARN [htable-pool1-t51]
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
> hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,030 WARN [htable-pool1-t51]
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
> hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,179 WARN [htable-pool1-t51]
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
> hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,251 WARN [htable-pool1-t51]
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
> hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,277 WARN [htable-pool1-t51]
> org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops
> on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
> resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
> hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:26,862 INFO [main]
> org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number
> of running tasks to be equals or less than 0, tasksSent=12186,
> tasksDone=12185, currentTasksDone=12185,
> tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:26,915 INFO [main]
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closing zookeeper sessionid=0x14160ad6eb404c3
> 2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session:
> 0x14160ad6eb404c3 closed
> 2013-09-27 22:53:26,920 INFO [main-EventThread]
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-09-27 22:53:26,936 ERROR [main]
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hrt_qa (auth:SIMPLE)
> cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed 36 actions: ConnectException: 4 times, FailedServerException: 32
> times,
> 2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child :
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 36 actions: ConnectException: 4 times, FailedServerException: 32 times,
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
> at
> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
> at
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
> at
> org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> {noformat}
> Likewise, there's nothing "broken" in the RS logs. It's sending back
> RegionTooBusyExceptions but neither RS nor client inform the user of this.
> We need to add logging somewhere, probably on the client side, but maybe
> both, to tell our owner-operator what's going on. The client log doesn't say
> much and all you have in the RS log is a whole bunch of flush and compaction
> activity. In this case, it would have been nice to see something like
> "RegionTooBusyException: above the memstore limit". Even better, include a
> recommendation to get around the problem, such as "slow down, tiger, that's
> quite the write throughput" or "maybe you want to increase memstore size a
> bit". Or maybe the right approach is to advise tweaks to the client settings
> (retries, sleep periods, buffer size, &c.)?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira