[
https://issues.apache.org/jira/browse/HBASE-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998937#comment-12998937
]
stack commented on HBASE-3379:
------------------------------
>From a log last night:
{code}
21:06 < jdcryans> jstack looks like
21:06 < jdcryans> Iat
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2628)
21:06 < jdcryans> Iat
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2829)
21:06 < jdcryans> Iat
org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:528)
21:06 < jdcryans> Iat
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
21:06 < jdcryans> Iat
org.apache.hadoop.fs.FileSystem.append(FileSystem.java:572)
21:07 < jdcryans> Iat
org.apache.hadoop.hbase.util.FSUtils.recoverFileLease(FSUtils.java:619)
{code}
We need to make it so recoverFileLease is calling the new recoverLease rather
than append.
> Log splitting slowed by repeated attempts at connecting to downed datanode
> --------------------------------------------------------------------------
>
> Key: HBASE-3379
> URL: https://issues.apache.org/jira/browse/HBASE-3379
> Project: HBase
> Issue Type: Bug
> Components: wal
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.92.0
>
>
> Testing if I kill RS and DN on a node, log splitting takes longer as we
> doggedly try connecting to the downed DN to get WAL blocks. Here's the cycle
> I see:
> {code}
> 2010-12-21 17:34:48,239 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
> for block blk_900551257176291912_1203821 failed because recovery from
> primary datanode 10.20.20.182:10010 failed 5 times. Pipeline was
> 10.20.20.184:10010, 10.20.20.186:10010, 10.20.20.182:10010. Will retry...
> 2010-12-21 17:34:50,240 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 0 time(s).
> 2010-12-21 17:34:51,241 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 1 time(s).
> 2010-12-21 17:34:52,241 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 2 time(s).
> 2010-12-21 17:34:53,242 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 3 time(s).
> 2010-12-21 17:34:54,243 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 4 time(s).
> 2010-12-21 17:34:55,243 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 5 time(s).
> 2010-12-21 17:34:56,244 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 6 time(s).
> 2010-12-21 17:34:57,245 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 7 time(s).
> 2010-12-21 17:34:58,245 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 8 time(s).
> 2010-12-21 17:34:59,246 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.20.20.182:10020. Already tried 9 time(s).
> 2010-12-21 17:34:59,246 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #5 from primary datanode 10.20.20.182:10010
> java.net.ConnectException: Call to /10.20.20.182:10020 failed on connection
> exception: java.net.ConnectException: Connection refused
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)
> at org.apache.hadoop.ipc.Client.call(Client.java:743)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> at $Proxy8.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
> ...
> {code}
> "because recovery from primary datanode" is done 5 times (hardcoded). Within
> these retries we'll do
> {code}
> this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10);
> {code}
> The hardcoding of 5 attempts we should get fixed and we should doc the
> ipc.client.connect.max.retries as important config. We should recommend
> bringing it down from default.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira