Re: [Gluster-users] One host won't rebalance

Branden Timm Thu, 04 Jun 2015 09:49:36 -0700

Atin, thank you for the response.  Indeed I have investigated the locks on that 
file, and it is a glusterfs process with an exclusive read/write lock on the 
entire file:


lsof 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
glusterfs 12776 root    6uW  REG  253,1        6 15730814 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid

That process was invoked with the following options:

ps -ef | grep 12776
root     12776     1  0 Jun03 ?        00:00:03 /usr/sbin/glusterfs -s 
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes 
--xlator-option *dht.lookup-unhashed=yes --xlator-option 
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off 
--xlator-option *replicate*.metadata-self-heal=off --xlator-option 
*replicate*.entry-self-heal=off --xlator-option 
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
--xlator-option *dht.rebalance-cmd=1 --xlator-option 
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
--pid-file 
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
 -l /var/log/glusterfs/bigdata2-rebalance.log

Not sure if this information is helpful, but thanks for your reply.

________________________________________
From: Atin Mukherjee <[email protected]>
Sent: Thursday, June 4, 2015 9:24 AM
To: Branden Timm; [email protected]; Nithya Balachandran; Susant Palai; 
Shyamsundar Ranganathan
Subject: Re: [Gluster-users] One host won't rebalance

On 06/04/2015 06:30 PM, Branden Timm wrote:
> I'm really hoping somebody can at least point me in the right direction on 
> how to diagnose this. This morning, roughly 24 hours after initiating the 
> rebalance, one host of three in the cluster still hasn't done anything:
>
>
>  Node       Rebalanced-files          size       scanned      failures       
> skipped               status   run time in secs
>  ---------      -----------   -----------   -----------   -----------   
> -----------         ------------     --------------
>  localhost             2543        14.2TB         11162             0         
>     0          in progress           60946.00
>  gluster-8             1358         6.7TB          9298             0         
>     0          in progress           60946.00
>  gluster-6                0        0Bytes             0             0         
>     0          in progress               0.00
>
>
> The only error showing up in the rebalance log is this:
>
>
> [2015-06-03 19:59:58.314100] E [MSGID: 100018] 
> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>  lock failed [Resource temporarily unavailable]
This looks like acquiring posix file lock failed and seems like
rebalance is *actually not* running. I would leave it to dht folks to
comment on it.

~Atin
>
>
> Any help would be greatly appreciated!
>
>
>
> ________________________________
> From: [email protected] <[email protected]> 
> on behalf of Branden Timm <[email protected]>
> Sent: Wednesday, June 3, 2015 11:52 AM
> To: [email protected]
> Subject: [Gluster-users] One host won't rebalance
>
>
> Greetings Gluster Users,
>
> I started a rebalance operation on my distributed volume today (CentOS 
> 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is 
> just sitting at 0.00 for 'run time in secs', and shows 0 files scanned, 
> failed, or skipped.
>
>
> I've reviewed the rebalance log for the affected server, and I'm seeing these 
> messages:
>
>
> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] 
> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 
> (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 
> --xlator-option *dht.use-readdirp=yes --xlator-option 
> *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes 
> --xlator-option *replicate*.data-self-heal=off --xlator-option 
> *replicate*.metadata-self-heal=off --xlator-option 
> *replicate*.entry-self-heal=off --xlator-option 
> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
> --pid-file 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>  -l /var/log/glusterfs/bigdata2-rebalance.log)
> [2015-06-03 15:34:32.704217] E [MSGID: 100018] 
> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>  lock failed [Resource temporarily unavailable]
>
>
> I initially investigated the first warning, readv on 127.0.0.1:24007 failed. 
> netstat shows that ip/port belonging to a glusterd process. Beyond that I 
> wasn't able to tell why there would be a problem.
>
>
> Next, I checked out what was up with the lock file that reported resource 
> temprarily unavailable. The file is present and contains the pid of a running 
> glusterd process:
>
>
> root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s 
> localhost --volfile-id rebalance/bigdata2 --xlator-option 
> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
> --xlator-option *dht.assert-no-child-down=yes --xlator-option 
> *replicate*.data-self-heal=off --xlator-option 
> *replicate*.metadata-self-heal=off --xlator-option 
> *replicate*.entry-self-heal=off --xlator-option 
> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
> --pid-file 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>  -l /var/log/glusterfs/bigdata2-rebalance.log
>
>
> Finally, one other thing I saw from running 'gluster volume status <volname> 
> clients' is that the affected server is the only one of the three that lists 
> a 127.0.0.1:<port> client for each of it's bricks. I don't know why there 
> would be a client coming from loopback on the server, but it seems strange. 
> Additionally, it makes me wonder if the fact that I have auth.allow set to a 
> single subnet (that doesn't include 127.0.0.1) is causing this problem for 
> some reason, or if loopback is implicitly allowed to connect.
>
>
> Any tips or suggestions would be much appreciated. Thanks!
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://www.gluster.org/mailman/listinfo/gluster-users
>

--
~Atin
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One host won't rebalance

Reply via email to