Greetings Gluster Users,

I started a rebalance operation on my distributed volume today (CentOS 
6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is just 
sitting at 0.00 for 'run time in secs', and shows 0 files scanned, failed, or 
skipped.


I've reviewed the rebalance log for the affected server, and I'm seeing these 
messages:


[2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] 
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 (args: 
/usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 
--xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
--xlator-option *dht.assert-no-child-down=yes --xlator-option 
*replicate*.data-self-heal=off --xlator-option 
*replicate*.metadata-self-heal=off --xlator-option 
*replicate*.entry-self-heal=off --xlator-option 
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
--xlator-option *dht.rebalance-cmd=1 --xlator-option 
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
--pid-file 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 -l /var/log/glusterfs/bigdata2-rebalance.log)
[2015-06-03 15:34:32.704217] E [MSGID: 100018] 
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 lock failed [Resource temporarily unavailable]


I initially investigated the first warning, readv on 127.0.0.1:24007 failed. 
netstat shows that ip/port belonging to a glusterd process. Beyond that I 
wasn't able to tell why there would be a problem.


Next, I checked out what was up with the lock file that reported resource 
temprarily unavailable. The file is present and contains the pid of a running 
glusterd process:


root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s 
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes 
--xlator-option *dht.lookup-unhashed=yes --xlator-option 
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off 
--xlator-option *replicate*.metadata-self-heal=off --xlator-option 
*replicate*.entry-self-heal=off --xlator-option 
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
--xlator-option *dht.rebalance-cmd=1 --xlator-option 
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
--pid-file 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 -l /var/log/glusterfs/bigdata2-rebalance.log


Finally, one other thing I saw from running 'gluster volume status <volname> 
clients' is that the affected server is the only one of the three that lists a 
127.0.0.1:<port> client for each of it's bricks. I don't know why there would 
be a client coming from loopback on the server, but it seems strange. 
Additionally, it makes me wonder if the fact that I have auth.allow set to a 
single subnet (that doesn't include 127.0.0.1) is causing this problem for some 
reason, or if loopback is implicitly allowed to connect.


Any tips or suggestions would be much appreciated. Thanks!

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to