[ 
https://issues.apache.org/jira/browse/HDFS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699513#comment-14699513
 ] 

Rakesh R commented on HDFS-8897:
--------------------------------

My observation about the case is - Balancer is seeing two nameservice IDs but 
both are pointing to the same cluster, one with {{hdfs://sandbox/}} slash and 
other {{hdfs://sandbox}}. While running balancer it will establish 
NameNodeConnectors and internally creates the idFilePath {{balancer.id}} to 
prevent simultaneous balancer operations. Since both {{nameservice IDs}} are 
pointing to the same cluster, for the first connector {{balancer.id}} creation 
will be succeeded and then again tries to create the {{balancer.id}} for the 
second connector it sees idFilePath already exists and resulting in failure. 
IMHO, we could find the reason for two occurrences of the same cluster ID to 
understand it well, right?

bq.It was working fine with hdfs 2.6.0.
The validation to prevent the simultaneous balancing has modified in 2.7.1, 
thats the reason you are not seeing any problem with 2.6.0 version.

> Loadbalancer always exits with : java.io.IOException: Another Balancer is 
> running..  Exiting ...
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8897
>                 URL: https://issues.apache.org/jira/browse/HDFS-8897
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.7.1
>         Environment: Centos 6.6
>            Reporter: LINTE
>
> When balancer is launched, it should test if there is already a 
> /system/balancer.id file in HDFS.
> When the file doesn't exist, the balancer don't want to run : 
> 15/08/14 16:35:12 INFO balancer.Balancer: namenodes  = [hdfs://sandbox/, 
> hdfs://sandbox]
> 15/08/14 16:35:12 INFO balancer.Balancer: parameters = 
> Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration 
> = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> java.io.IOException: Another Balancer is running..  Exiting ...
> Aug 14, 2015 4:35:14 PM  Balancing took 2.408 seconds
> Looking at the audit log file when trying to run the balancer, the balancer 
> create the /system/balancer.id and then delete it on exiting ... 
> 2015-08-14 16:37:45,844 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=getfileinfo 
> src=/system/balancer.id dst=null        perm=null       proto=rpc
> 2015-08-14 16:37:45,900 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=create      
> src=/system/balancer.id dst=null        perm=hdfs:hadoop:rw-r-----      
> proto=rpc
> 2015-08-14 16:37:45,919 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=getfileinfo 
> src=/system/balancer.id dst=null        perm=null       proto=rpc
> 2015-08-14 16:37:46,090 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=getfileinfo 
> src=/system/balancer.id dst=null        perm=null       proto=rpc
> 2015-08-14 16:37:46,112 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=getfileinfo 
> src=/system/balancer.id dst=null        perm=null       proto=rpc
> 2015-08-14 16:37:46,117 INFO FSNamesystem.audit: allowed=true   
> [email protected] (auth:KERBEROS) ip=/x.x.x.x       cmd=delete      
> src=/system/balancer.id dst=null        perm=null       proto=rpc
> The error seems to be located in 
> org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java 
> The function checkAndMarkRunning return null even if the /system/balancer.id 
> doesn't exist before entering this function; if it exists, then it is deleted 
> and the balancer exit with the same error.
> ----
>   private OutputStream checkAndMarkRunning() throws IOException {
>     try {
>       if (fs.exists(idPath)) {
>         // try appending to it so that it will fail fast if another balancer 
> is
>         // running.
>         IOUtils.closeStream(fs.append(idPath));
>         fs.delete(idPath, true);
>       }
>       final FSDataOutputStream fsout = fs.create(idPath, false);
>       // mark balancer idPath to be deleted during filesystem closure
>       fs.deleteOnExit(idPath);
>       if (write2IdFile) {
>         fsout.writeBytes(InetAddress.getLocalHost().getHostName());
>         fsout.hflush();
>       }
>       return fsout;
>     } catch(RemoteException e) {
>       
> if(AlreadyBeingCreatedException.class.getName().equals(e.getClassName())){
>         return null;
>       } else {
>         throw e;
>       }
>     }
>   }
> ----
> Regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to