I'm really hoping somebody can at least point me in the right direction on how 
to diagnose this. This morning, roughly 24 hours after initiating the 
rebalance, one host of three in the cluster still hasn't done anything:


 Node       Rebalanced-files          size       scanned      failures       
skipped               status   run time in secs
 ---------      -----------   -----------   -----------   -----------   
-----------         ------------     --------------
 localhost             2543        14.2TB         11162             0           
  0          in progress           60946.00
 gluster-8             1358         6.7TB          9298             0           
  0          in progress           60946.00
 gluster-6                0        0Bytes             0             0           
  0          in progress               0.00


The only error showing up in the rebalance log is this:


[2015-06-03 19:59:58.314100] E [MSGID: 100018] 
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 lock failed [Resource temporarily unavailable]


Any help would be greatly appreciated!



________________________________
From: [email protected] <[email protected]> on 
behalf of Branden Timm <[email protected]>
Sent: Wednesday, June 3, 2015 11:52 AM
To: [email protected]
Subject: [Gluster-users] One host won't rebalance


Greetings Gluster Users,

I started a rebalance operation on my distributed volume today (CentOS 
6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is just 
sitting at 0.00 for 'run time in secs', and shows 0 files scanned, failed, or 
skipped.


I've reviewed the rebalance log for the affected server, and I'm seeing these 
messages:


[2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] 
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 (args: 
/usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 
--xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
--xlator-option *dht.assert-no-child-down=yes --xlator-option 
*replicate*.data-self-heal=off --xlator-option 
*replicate*.metadata-self-heal=off --xlator-option 
*replicate*.entry-self-heal=off --xlator-option 
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
--xlator-option *dht.rebalance-cmd=1 --xlator-option 
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
--pid-file 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 -l /var/log/glusterfs/bigdata2-rebalance.log)
[2015-06-03 15:34:32.704217] E [MSGID: 100018] 
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 lock failed [Resource temporarily unavailable]


I initially investigated the first warning, readv on 127.0.0.1:24007 failed. 
netstat shows that ip/port belonging to a glusterd process. Beyond that I 
wasn't able to tell why there would be a problem.


Next, I checked out what was up with the lock file that reported resource 
temprarily unavailable. The file is present and contains the pid of a running 
glusterd process:


root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s 
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes 
--xlator-option *dht.lookup-unhashed=yes --xlator-option 
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off 
--xlator-option *replicate*.metadata-self-heal=off --xlator-option 
*replicate*.entry-self-heal=off --xlator-option 
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
--xlator-option *dht.rebalance-cmd=1 --xlator-option 
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
--pid-file 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 -l /var/log/glusterfs/bigdata2-rebalance.log


Finally, one other thing I saw from running 'gluster volume status <volname> 
clients' is that the affected server is the only one of the three that lists a 
127.0.0.1:<port> client for each of it's bricks. I don't know why there would 
be a client coming from loopback on the server, but it seems strange. 
Additionally, it makes me wonder if the fact that I have auth.allow set to a 
single subnet (that doesn't include 127.0.0.1) is causing this problem for some 
reason, or if loopback is implicitly allowed to connect.


Any tips or suggestions would be much appreciated. Thanks!

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to