Atin, thank you for the response. Indeed I have investigated the locks on that file, and it is a glusterfs process with an exclusive read/write lock on the entire file:
lsof /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME glusterfs 12776 root 6uW REG 253,1 6 15730814 /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid That process was invoked with the following options: ps -ef | grep 12776 root 12776 1 0 Jun03 ? 00:00:03 /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1 --xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock --pid-file /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid -l /var/log/glusterfs/bigdata2-rebalance.log Not sure if this information is helpful, but thanks for your reply. ________________________________________ From: Atin Mukherjee <[email protected]> Sent: Thursday, June 4, 2015 9:24 AM To: Branden Timm; [email protected]; Nithya Balachandran; Susant Palai; Shyamsundar Ranganathan Subject: Re: [Gluster-users] One host won't rebalance On 06/04/2015 06:30 PM, Branden Timm wrote: > I'm really hoping somebody can at least point me in the right direction on > how to diagnose this. This morning, roughly 24 hours after initiating the > rebalance, one host of three in the cluster still hasn't done anything: > > > Node Rebalanced-files size scanned failures > skipped status run time in secs > --------- ----------- ----------- ----------- ----------- > ----------- ------------ -------------- > localhost 2543 14.2TB 11162 0 > 0 in progress 60946.00 > gluster-8 1358 6.7TB 9298 0 > 0 in progress 60946.00 > gluster-6 0 0Bytes 0 0 > 0 in progress 0.00 > > > The only error showing up in the rebalance log is this: > > > [2015-06-03 19:59:58.314100] E [MSGID: 100018] > [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > lock failed [Resource temporarily unavailable] This looks like acquiring posix file lock failed and seems like rebalance is *actually not* running. I would leave it to dht folks to comment on it. ~Atin > > > Any help would be greatly appreciated! > > > > ________________________________ > From: [email protected] <[email protected]> > on behalf of Branden Timm <[email protected]> > Sent: Wednesday, June 3, 2015 11:52 AM > To: [email protected] > Subject: [Gluster-users] One host won't rebalance > > > Greetings Gluster Users, > > I started a rebalance operation on my distributed volume today (CentOS > 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is > just sitting at 0.00 for 'run time in secs', and shows 0 files scanned, > failed, or skipped. > > > I've reviewed the rebalance log for the affected server, and I'm seeing these > messages: > > > [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 > (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 > --xlator-option *dht.use-readdirp=yes --xlator-option > *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes > --xlator-option *replicate*.data-self-heal=off --xlator-option > *replicate*.metadata-self-heal=off --xlator-option > *replicate*.entry-self-heal=off --xlator-option > *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on > --xlator-option *dht.rebalance-cmd=1 --xlator-option > *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file > /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock > --pid-file > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > -l /var/log/glusterfs/bigdata2-rebalance.log) > [2015-06-03 15:34:32.704217] E [MSGID: 100018] > [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > lock failed [Resource temporarily unavailable] > > > I initially investigated the first warning, readv on 127.0.0.1:24007 failed. > netstat shows that ip/port belonging to a glusterd process. Beyond that I > wasn't able to tell why there would be a problem. > > > Next, I checked out what was up with the lock file that reported resource > temprarily unavailable. The file is present and contains the pid of a running > glusterd process: > > > root 12776 1 0 10:18 ? 00:00:00 /usr/sbin/glusterfs -s > localhost --volfile-id rebalance/bigdata2 --xlator-option > *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes > --xlator-option *dht.assert-no-child-down=yes --xlator-option > *replicate*.data-self-heal=off --xlator-option > *replicate*.metadata-self-heal=off --xlator-option > *replicate*.entry-self-heal=off --xlator-option > *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on > --xlator-option *dht.rebalance-cmd=1 --xlator-option > *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file > /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock > --pid-file > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > -l /var/log/glusterfs/bigdata2-rebalance.log > > > Finally, one other thing I saw from running 'gluster volume status <volname> > clients' is that the affected server is the only one of the three that lists > a 127.0.0.1:<port> client for each of it's bricks. I don't know why there > would be a client coming from loopback on the server, but it seems strange. > Additionally, it makes me wonder if the fact that I have auth.allow set to a > single subnet (that doesn't include 127.0.0.1) is causing this problem for > some reason, or if loopback is implicitly allowed to connect. > > > Any tips or suggestions would be much appreciated. Thanks! > > > > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-users > -- ~Atin _______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
