Nick Dimiduk created HBASE-9675:
-----------------------------------
Summary: Root-cause analysis is difficult with AsyncProcess
Key: HBASE-9675
URL: https://issues.apache.org/jira/browse/HBASE-9675
Project: HBase
Issue Type: Improvement
Components: Client
Affects Versions: 0.95.2
Reporter: Nick Dimiduk
The AsyncProcess swallows up informative exceptions returned from the RS. For
instance, diagnosing the cause of a failed mapreduce task is pretty opaque
right now. The client log tells me that my async requests have been retried and
rejected, but not why:
{noformat}
2013-09-27 22:47:03,391 INFO [main]
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of
running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180,
currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
// repeat about 30 times
2013-09-27 22:53:13,886 INFO [main]
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of
running tasks to be equals or less than 0, tasksSent=12140, tasksDone=12135,
currentTasksDone=12135, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:13,890 WARN [htable-pool1-t51]
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops on
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,030 WARN [htable-pool1-t51]
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,179 WARN [htable-pool1-t51]
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,251 WARN [htable-pool1-t51]
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,277 WARN [htable-pool1-t51]
org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on
hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT
resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:26,862 INFO [main]
org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of
running tasks to be equals or less than 0, tasksSent=12186, tasksDone=12185,
currentTasksDone=12185, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:26,915 INFO [main]
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14160ad6eb404c3
2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session:
0x14160ad6eb404c3 closed
2013-09-27 22:53:26,920 INFO [main-EventThread]
org.apache.zookeeper.ClientCnxn: EventThread shut down
2013-09-27 22:53:26,936 ERROR [main]
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hrt_qa (auth:SIMPLE)
cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times,
2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child :
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36
actions: ConnectException: 4 times, FailedServerException: 32 times,
at
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
at
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
at
org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
at
org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
at
org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
at
org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
{noformat}
Likewise, there's nothing "broken" in the RS logs. It's sending back
RegionTooBusyExceptions but neither RS nor client inform the user of this.
We need to add logging somewhere, probably on the client side, but maybe both,
to tell our owner-operator what's going on. The client log doesn't say much and
all you have in the RS log is a whole bunch of flush and compaction activity.
In this case, it would have been nice to see something like
"RegionTooBusyException: above the memstore limit". Even better, include a
recommendation to get around the problem, such as "slow down, tiger, that's
quite the write throughput" or "maybe you want to increase memstore size a
bit". Or maybe the right approach is to advise tweaks to the client settings
(retries, sleep periods, buffer size, &c.)?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira