[ https://issues.apache.org/jira/browse/HDFS-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460711#comment-16460711 ]
Istvan Fajth commented on HDFS-13174: ------------------------------------- Attaching a patch for review. The patch contains some refactoring to make the iteration time configurable. I have added a configuration for the Balancer to control the maximum iteration time, it seemed reasonable, however that might not need to be exposed, in this initial patch I have exposed it. Added a test for Balancer to test the max iteration time is respected, in the test to make it run in a reasonable timeframe with reasonable amount of resources used, I had to use the deprecated DFSConfigKeys.DFS_CLIENT_SOCKET_TIMEOUT_KEY, I am not sure but if there are any better way to control how often the DN gets back to the client to keepalive the connection, I would be glad to know that, this was the only way to affect that, and the newly introduced HdfsClientConfigKeys.DFS_CLIENT_SOCKET_TIMEOUT_KEY is not visible in the test package, and I did not find a way to tune the same in the DN. Added a test for Balancer, if in Dispatcher you set the newly added constructor parameter to a value higher than 0 like for example 200L the test fails because no blocks were moved as the block moves were timed out, this was the case with the previous constant. Updating the Jira description as well as I learned a few things about the issue. > hdfs mover -p /path times out after 20 min > ------------------------------------------ > > Key: HDFS-13174 > URL: https://issues.apache.org/jira/browse/HDFS-13174 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover > Affects Versions: 2.8.0, 2.7.4, 3.0.0-alpha2 > Reporter: Istvan Fajth > Assignee: Istvan Fajth > Priority: Major > Attachments: HDFS-13174.001.patch > > > In HDFS-11015 there is an iteration timeout introduced in Dispatcher.Source > class, that is checked during dispatching the moves that the Balancer and the > Mover does. This timeout is hardwired to 20 minutes. > In the Balancer we have iterations, and even if an iteration is timing out > the Balancer runs further and does an other iteration before it fails if > there were no moves happened in a few iterations. > The Mover on the other hand does not have iterations, so if moving a path > runs for more than 20 minutes, after 20 minutes Mover will stop with the > following exception reported to the console (lines might differ as this > exception came from a CDH5.12.1 installation): > java.io.IOException: Block move timed out > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.receiveResponse(Dispatcher.java:382) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:328) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:186) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:956) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org