Re: [Gluster-users] One host won't rebalance

Branden Timm Thu, 04 Jun 2015 11:36:07 -0700

I should add that there are additional errors as well in the brick logs. I've 
posted them to a gist at 
https://gist.github.com/brandentimm/576432ddabd70184d257



________________________________
From: [email protected] <[email protected]> on 
behalf of Branden Timm <[email protected]>
Sent: Thursday, June 4, 2015 1:31 PM
To: Atin Mukherjee
Cc: [email protected]
Subject: Re: [Gluster-users] One host won't rebalance


I have stopped and restarted the rebalance several times, with no difference in 
results. I have restarted all gluster services several times, and completely 
rebooted the affected system.


Yes, gluster volume status does show an active rebalance task for volume 
bigdata2.


I just noticed something else in the brick logs. I am seeing tons of message 
similar to these two:


[2015-06-04 16:22:26.179797] E [posix-helpers.c:938:posix_handle_pair] 
0-bigdata2-posix: /<redacted path>: key:glusterfs-internal-fop flags: 1 
length:4 error:Operation not supported
[2015-06-04 16:22:26.179874] E [posix.c:2325:posix_create] 0-bigdata2-posix: 
setting xattrs on /<path redacted> failed (Operation not supported)


Note that both messages were referring to the same file. I have confirmed that 
xattr support is on in the underlying system. Additionally, these messages are 
NOT appearing on the other cluster members that seem to be unaffected by 
whatever is going on.


I found this bug which seems to be similar, but it was theoretically closed for 
the 3.6.1 release: https://bugzilla.redhat.com/show_bug.cgi?id=1098794


Thanks again for your help.


________________________________
From: Atin Mukherjee <[email protected]>
Sent: Thursday, June 4, 2015 1:25 PM
To: Branden Timm
Cc: Shyamsundar Ranganathan; Susant Palai; [email protected]; Atin 
Mukherjee; Nithya Balachandran
Subject: Re: [Gluster-users] One host won't rebalance


Sent from Samsung Galaxy S4
On 4 Jun 2015 22:18, "Branden Timm" <[email protected]<mailto:[email protected]>> 
wrote:
>
> Atin, thank you for the response.  Indeed I have investigated the locks on 
> that file, and it is a glusterfs process with an exclusive read/write lock on 
> the entire file:
>
> lsof 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
> COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
> glusterfs 12776 root    6uW  REG  253,1        6 15730814 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>
> That process was invoked with the following options:
>
> ps -ef | grep 12776
> root     12776     1  0 Jun03 ?        00:00:03 /usr/sbin/glusterfs -s 
> localhost --volfile-id rebalance/bigdata2 --xlator-option 
> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
> --xlator-option *dht.assert-no-child-down=yes --xlator-option 
> *replicate*.data-self-heal=off --xlator-option 
> *replicate*.metadata-self-heal=off --xlator-option 
> *replicate*.entry-self-heal=off --xlator-option 
> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
> --pid-file 
> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>  -l /var/log/glusterfs/bigdata2-rebalance.log
This means there is already a rebalance process alive. Could you help me with 
following:
1. What does bigdata2-rebalance.log says? Don't you see a shutting down log 
somewhere?
2. Does output of gluster volume status consider bigdata2 is in rebalancing?

As a work around can you kill this process and start a fresh rebalance process?
>
> Not sure if this information is helpful, but thanks for your reply.
>
> ________________________________________
> From: Atin Mukherjee <[email protected]<mailto:[email protected]>>
> Sent: Thursday, June 4, 2015 9:24 AM
> To: Branden Timm; 
> [email protected]<mailto:[email protected]>; Nithya 
> Balachandran; Susant Palai; Shyamsundar Ranganathan
> Subject: Re: [Gluster-users] One host won't rebalance
>
> On 06/04/2015 06:30 PM, Branden Timm wrote:
> > I'm really hoping somebody can at least point me in the right direction on 
> > how to diagnose this. This morning, roughly 24 hours after initiating the 
> > rebalance, one host of three in the cluster still hasn't done anything:
> >
> >
> >  Node       Rebalanced-files          size       scanned      failures      
> >  skipped               status   run time in secs
> >  ---------      -----------   -----------   -----------   -----------   
> > -----------         ------------     --------------
> >  localhost             2543        14.2TB         11162             0       
> >       0          in progress           60946.00
> >  gluster-8             1358         6.7TB          9298             0       
> >       0          in progress           60946.00
> >  gluster-6                0        0Bytes             0             0       
> >       0          in progress               0.00
> >
> >
> > The only error showing up in the rebalance log is this:
> >
> >
> > [2015-06-03 19:59:58.314100] E [MSGID: 100018] 
> > [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
> > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
> >  lock failed [Resource temporarily unavailable]
> This looks like acquiring posix file lock failed and seems like
> rebalance is *actually not* running. I would leave it to dht folks to
> comment on it.
>
> ~Atin
> >
> >
> > Any help would be greatly appreciated!
> >
> >
> >
> > ________________________________
> > From: 
> > [email protected]<mailto:[email protected]> 
> > <[email protected]<mailto:[email protected]>>
> >  on behalf of Branden Timm <[email protected]<mailto:[email protected]>>
> > Sent: Wednesday, June 3, 2015 11:52 AM
> > To: [email protected]<mailto:[email protected]>
> > Subject: [Gluster-users] One host won't rebalance
> >
> >
> > Greetings Gluster Users,
> >
> > I started a rebalance operation on my distributed volume today (CentOS 
> > 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is 
> > just sitting at 0.00 for 'run time in secs', and shows 0 files scanned, 
> > failed, or skipped.
> >
> >
> > I've reviewed the rebalance log for the affected server, and I'm seeing 
> > these messages:
> >
> >
> > [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] 
> > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 
> > (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 
> > --xlator-option *dht.use-readdirp=yes --xlator-option 
> > *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes 
> > --xlator-option *replicate*.data-self-heal=off --xlator-option 
> > *replicate*.metadata-self-heal=off --xlator-option 
> > *replicate*.entry-self-heal=off --xlator-option 
> > *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
> > --xlator-option *dht.rebalance-cmd=1 --xlator-option 
> > *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
> > /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
> >  --pid-file 
> > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
> >  -l /var/log/glusterfs/bigdata2-rebalance.log)
> > [2015-06-03 15:34:32.704217] E [MSGID: 100018] 
> > [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
> > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
> >  lock failed [Resource temporarily unavailable]
> >
> >
> > I initially investigated the first warning, readv on 
> > 127.0.0.1:24007<http://127.0.0.1:24007> failed. netstat shows that ip/port 
> > belonging to a glusterd process. Beyond that I wasn't able to tell why 
> > there would be a problem.
> >
> >
> > Next, I checked out what was up with the lock file that reported resource 
> > temprarily unavailable. The file is present and contains the pid of a 
> > running glusterd process:
> >
> >
> > root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s 
> > localhost --volfile-id rebalance/bigdata2 --xlator-option 
> > *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
> > --xlator-option *dht.assert-no-child-down=yes --xlator-option 
> > *replicate*.data-self-heal=off --xlator-option 
> > *replicate*.metadata-self-heal=off --xlator-option 
> > *replicate*.entry-self-heal=off --xlator-option 
> > *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
> > --xlator-option *dht.rebalance-cmd=1 --xlator-option 
> > *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
> > /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
> >  --pid-file 
> > /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
> >  -l /var/log/glusterfs/bigdata2-rebalance.log
> >
> >
> > Finally, one other thing I saw from running 'gluster volume status 
> > <volname> clients' is that the affected server is the only one of the three 
> > that lists a 127.0.0.1<http://127.0.0.1>:<port> client for each of it's 
> > bricks. I don't know why there would be a client coming from loopback on 
> > the server, but it seems strange. Additionally, it makes me wonder if the 
> > fact that I have auth.allow set to a single subnet (that doesn't include 
> > 127.0.0.1) is causing this problem for some reason, or if loopback is 
> > implicitly allowed to connect.
> >
> >
> > Any tips or suggestions would be much appreciated. Thanks!
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > [email protected]<mailto:[email protected]>
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
> --
> ~Atin
> _______________________________________________
> Gluster-users mailing list
> [email protected]<mailto:[email protected]>
> http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One host won't rebalance

Reply via email to