On 06/05/2015 12:05 AM, Branden Timm wrote: > I should add that there are additional errors as well in the brick logs. I've > posted them to a gist at > https://gist.github.com/brandentimm/576432ddabd70184d257 As I mentioned earlier, DHT team can answer all your question on this failure.
~Atin > > > ________________________________ > From: [email protected] <[email protected]> > on behalf of Branden Timm <[email protected]> > Sent: Thursday, June 4, 2015 1:31 PM > To: Atin Mukherjee > Cc: [email protected] > Subject: Re: [Gluster-users] One host won't rebalance > > > I have stopped and restarted the rebalance several times, with no difference > in results. I have restarted all gluster services several times, and > completely rebooted the affected system. > > > Yes, gluster volume status does show an active rebalance task for volume > bigdata2. > > > I just noticed something else in the brick logs. I am seeing tons of message > similar to these two: > > > [2015-06-04 16:22:26.179797] E [posix-helpers.c:938:posix_handle_pair] > 0-bigdata2-posix: /<redacted path>: key:glusterfs-internal-fop flags: 1 > length:4 error:Operation not supported > [2015-06-04 16:22:26.179874] E [posix.c:2325:posix_create] 0-bigdata2-posix: > setting xattrs on /<path redacted> failed (Operation not supported) > > > Note that both messages were referring to the same file. I have confirmed > that xattr support is on in the underlying system. Additionally, these > messages are NOT appearing on the other cluster members that seem to be > unaffected by whatever is going on. > > > I found this bug which seems to be similar, but it was theoretically closed > for the 3.6.1 release: https://bugzilla.redhat.com/show_bug.cgi?id=1098794 > > > Thanks again for your help. > > > ________________________________ > From: Atin Mukherjee <[email protected]> > Sent: Thursday, June 4, 2015 1:25 PM > To: Branden Timm > Cc: Shyamsundar Ranganathan; Susant Palai; [email protected]; Atin > Mukherjee; Nithya Balachandran > Subject: Re: [Gluster-users] One host won't rebalance > > > Sent from Samsung Galaxy S4 > On 4 Jun 2015 22:18, "Branden Timm" <[email protected]<mailto:[email protected]>> > wrote: >> >> Atin, thank you for the response. Indeed I have investigated the locks on >> that file, and it is a glusterfs process with an exclusive read/write lock >> on the entire file: >> >> lsof >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >> glusterfs 12776 root 6uW REG 253,1 6 15730814 >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >> >> That process was invoked with the following options: >> >> ps -ef | grep 12776 >> root 12776 1 0 Jun03 ? 00:00:03 /usr/sbin/glusterfs -s >> localhost --volfile-id rebalance/bigdata2 --xlator-option >> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes >> --xlator-option *dht.assert-no-child-down=yes --xlator-option >> *replicate*.data-self-heal=off --xlator-option >> *replicate*.metadata-self-heal=off --xlator-option >> *replicate*.entry-self-heal=off --xlator-option >> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on >> --xlator-option *dht.rebalance-cmd=1 --xlator-option >> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file >> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock >> --pid-file >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >> -l /var/log/glusterfs/bigdata2-rebalance.log > This means there is already a rebalance process alive. Could you help me with > following: > 1. What does bigdata2-rebalance.log says? Don't you see a shutting down log > somewhere? > 2. Does output of gluster volume status consider bigdata2 is in rebalancing? > > As a work around can you kill this process and start a fresh rebalance > process? >> >> Not sure if this information is helpful, but thanks for your reply. >> >> ________________________________________ >> From: Atin Mukherjee <[email protected]<mailto:[email protected]>> >> Sent: Thursday, June 4, 2015 9:24 AM >> To: Branden Timm; >> [email protected]<mailto:[email protected]>; Nithya >> Balachandran; Susant Palai; Shyamsundar Ranganathan >> Subject: Re: [Gluster-users] One host won't rebalance >> >> On 06/04/2015 06:30 PM, Branden Timm wrote: >>> I'm really hoping somebody can at least point me in the right direction on >>> how to diagnose this. This morning, roughly 24 hours after initiating the >>> rebalance, one host of three in the cluster still hasn't done anything: >>> >>> >>> Node Rebalanced-files size scanned failures >>> skipped status run time in secs >>> --------- ----------- ----------- ----------- ----------- >>> ----------- ------------ -------------- >>> localhost 2543 14.2TB 11162 0 >>> 0 in progress 60946.00 >>> gluster-8 1358 6.7TB 9298 0 >>> 0 in progress 60946.00 >>> gluster-6 0 0Bytes 0 0 >>> 0 in progress 0.00 >>> >>> >>> The only error showing up in the rebalance log is this: >>> >>> >>> [2015-06-03 19:59:58.314100] E [MSGID: 100018] >>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >>> lock failed [Resource temporarily unavailable] >> This looks like acquiring posix file lock failed and seems like >> rebalance is *actually not* running. I would leave it to dht folks to >> comment on it. >> >> ~Atin >>> >>> >>> Any help would be greatly appreciated! >>> >>> >>> >>> ________________________________ >>> From: >>> [email protected]<mailto:[email protected]> >>> <[email protected]<mailto:[email protected]>> >>> on behalf of Branden Timm <[email protected]<mailto:[email protected]>> >>> Sent: Wednesday, June 3, 2015 11:52 AM >>> To: [email protected]<mailto:[email protected]> >>> Subject: [Gluster-users] One host won't rebalance >>> >>> >>> Greetings Gluster Users, >>> >>> I started a rebalance operation on my distributed volume today (CentOS >>> 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is >>> just sitting at 0.00 for 'run time in secs', and shows 0 files scanned, >>> failed, or skipped. >>> >>> >>> I've reviewed the rebalance log for the affected server, and I'm seeing >>> these messages: >>> >>> >>> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] >>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 >>> (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 >>> --xlator-option *dht.use-readdirp=yes --xlator-option >>> *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes >>> --xlator-option *replicate*.data-self-heal=off --xlator-option >>> *replicate*.metadata-self-heal=off --xlator-option >>> *replicate*.entry-self-heal=off --xlator-option >>> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on >>> --xlator-option *dht.rebalance-cmd=1 --xlator-option >>> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file >>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock >>> --pid-file >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >>> -l /var/log/glusterfs/bigdata2-rebalance.log) >>> [2015-06-03 15:34:32.704217] E [MSGID: 100018] >>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >>> lock failed [Resource temporarily unavailable] >>> >>> >>> I initially investigated the first warning, readv on >>> 127.0.0.1:24007<http://127.0.0.1:24007> failed. netstat shows that ip/port >>> belonging to a glusterd process. Beyond that I wasn't able to tell why >>> there would be a problem. >>> >>> >>> Next, I checked out what was up with the lock file that reported resource >>> temprarily unavailable. The file is present and contains the pid of a >>> running glusterd process: >>> >>> >>> root 12776 1 0 10:18 ? 00:00:00 /usr/sbin/glusterfs -s >>> localhost --volfile-id rebalance/bigdata2 --xlator-option >>> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes >>> --xlator-option *dht.assert-no-child-down=yes --xlator-option >>> *replicate*.data-self-heal=off --xlator-option >>> *replicate*.metadata-self-heal=off --xlator-option >>> *replicate*.entry-self-heal=off --xlator-option >>> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on >>> --xlator-option *dht.rebalance-cmd=1 --xlator-option >>> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file >>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock >>> --pid-file >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid >>> -l /var/log/glusterfs/bigdata2-rebalance.log >>> >>> >>> Finally, one other thing I saw from running 'gluster volume status >>> <volname> clients' is that the affected server is the only one of the three >>> that lists a 127.0.0.1<http://127.0.0.1>:<port> client for each of it's >>> bricks. I don't know why there would be a client coming from loopback on >>> the server, but it seems strange. Additionally, it makes me wonder if the >>> fact that I have auth.allow set to a single subnet (that doesn't include >>> 127.0.0.1) is causing this problem for some reason, or if loopback is >>> implicitly allowed to connect. >>> >>> >>> Any tips or suggestions would be much appreciated. Thanks! >>> >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> [email protected]<mailto:[email protected]> >>> http://www.gluster.org/mailman/listinfo/gluster-users >>> >> >> -- >> ~Atin >> _______________________________________________ >> Gluster-users mailing list >> [email protected]<mailto:[email protected]> >> http://www.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-users > -- ~Atin _______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
