Re: [Gluster-users] One host won't rebalance

Atin Mukherjee Thu, 04 Jun 2015 20:57:07 -0700


On 06/05/2015 12:05 AM, Branden Timm wrote:
> I should add that there are additional errors as well in the brick logs. I've 
> posted them to a gist at 
> https://gist.github.com/brandentimm/576432ddabd70184d257
As I mentioned earlier, DHT team can answer all your question on this
failure.


~Atin
> 
> 
> ________________________________
> From: [email protected] <[email protected]> 
> on behalf of Branden Timm <[email protected]>
> Sent: Thursday, June 4, 2015 1:31 PM
> To: Atin Mukherjee
> Cc: [email protected]
> Subject: Re: [Gluster-users] One host won't rebalance
> 
> 
> I have stopped and restarted the rebalance several times, with no difference 
> in results. I have restarted all gluster services several times, and 
> completely rebooted the affected system.
> 
> 
> Yes, gluster volume status does show an active rebalance task for volume 
> bigdata2.
> 
> 
> I just noticed something else in the brick logs. I am seeing tons of message 
> similar to these two:
> 
> 
> [2015-06-04 16:22:26.179797] E [posix-helpers.c:938:posix_handle_pair] 
> 0-bigdata2-posix: /<redacted path>: key:glusterfs-internal-fop flags: 1 
> length:4 error:Operation not supported
> [2015-06-04 16:22:26.179874] E [posix.c:2325:posix_create] 0-bigdata2-posix: 
> setting xattrs on /<path redacted> failed (Operation not supported)
> 
> 
> Note that both messages were referring to the same file. I have confirmed 
> that xattr support is on in the underlying system. Additionally, these 
> messages are NOT appearing on the other cluster members that seem to be 
> unaffected by whatever is going on.
> 
> 
> I found this bug which seems to be similar, but it was theoretically closed 
> for the 3.6.1 release: https://bugzilla.redhat.com/show_bug.cgi?id=1098794
> 
> 
> Thanks again for your help.
> 
> 
> ________________________________
> From: Atin Mukherjee <[email protected]>
> Sent: Thursday, June 4, 2015 1:25 PM
> To: Branden Timm
> Cc: Shyamsundar Ranganathan; Susant Palai; [email protected]; Atin 
> Mukherjee; Nithya Balachandran
> Subject: Re: [Gluster-users] One host won't rebalance
> 
> 
> Sent from Samsung Galaxy S4
> On 4 Jun 2015 22:18, "Branden Timm" <[email protected]<mailto:[email protected]>> 
> wrote:
>>
>> Atin, thank you for the response.  Indeed I have investigated the locks on 
>> that file, and it is a glusterfs process with an exclusive read/write lock 
>> on the entire file:
>>
>> lsof 
>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>> COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
>> glusterfs 12776 root    6uW  REG  253,1        6 15730814 
>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>
>> That process was invoked with the following options:
>>
>> ps -ef | grep 12776
>> root     12776     1  0 Jun03 ?        00:00:03 /usr/sbin/glusterfs -s 
>> localhost --volfile-id rebalance/bigdata2 --xlator-option 
>> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
>> --xlator-option *dht.assert-no-child-down=yes --xlator-option 
>> *replicate*.data-self-heal=off --xlator-option 
>> *replicate*.metadata-self-heal=off --xlator-option 
>> *replicate*.entry-self-heal=off --xlator-option 
>> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
>> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
>> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock 
>> --pid-file 
>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>  -l /var/log/glusterfs/bigdata2-rebalance.log
> This means there is already a rebalance process alive. Could you help me with 
> following:
> 1. What does bigdata2-rebalance.log says? Don't you see a shutting down log 
> somewhere?
> 2. Does output of gluster volume status consider bigdata2 is in rebalancing?
> 
> As a work around can you kill this process and start a fresh rebalance 
> process?
>>
>> Not sure if this information is helpful, but thanks for your reply.
>>
>> ________________________________________
>> From: Atin Mukherjee <[email protected]<mailto:[email protected]>>
>> Sent: Thursday, June 4, 2015 9:24 AM
>> To: Branden Timm; 
>> [email protected]<mailto:[email protected]>; Nithya 
>> Balachandran; Susant Palai; Shyamsundar Ranganathan
>> Subject: Re: [Gluster-users] One host won't rebalance
>>
>> On 06/04/2015 06:30 PM, Branden Timm wrote:
>>> I'm really hoping somebody can at least point me in the right direction on 
>>> how to diagnose this. This morning, roughly 24 hours after initiating the 
>>> rebalance, one host of three in the cluster still hasn't done anything:
>>>
>>>
>>>  Node       Rebalanced-files          size       scanned      failures      
>>>  skipped               status   run time in secs
>>>  ---------      -----------   -----------   -----------   -----------   
>>> -----------         ------------     --------------
>>>  localhost             2543        14.2TB         11162             0       
>>>       0          in progress           60946.00
>>>  gluster-8             1358         6.7TB          9298             0       
>>>       0          in progress           60946.00
>>>  gluster-6                0        0Bytes             0             0       
>>>       0          in progress               0.00
>>>
>>>
>>> The only error showing up in the rebalance log is this:
>>>
>>>
>>> [2015-06-03 19:59:58.314100] E [MSGID: 100018] 
>>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
>>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>>  lock failed [Resource temporarily unavailable]
>> This looks like acquiring posix file lock failed and seems like
>> rebalance is *actually not* running. I would leave it to dht folks to
>> comment on it.
>>
>> ~Atin
>>>
>>>
>>> Any help would be greatly appreciated!
>>>
>>>
>>>
>>> ________________________________
>>> From: 
>>> [email protected]<mailto:[email protected]> 
>>> <[email protected]<mailto:[email protected]>>
>>>  on behalf of Branden Timm <[email protected]<mailto:[email protected]>>
>>> Sent: Wednesday, June 3, 2015 11:52 AM
>>> To: [email protected]<mailto:[email protected]>
>>> Subject: [Gluster-users] One host won't rebalance
>>>
>>>
>>> Greetings Gluster Users,
>>>
>>> I started a rebalance operation on my distributed volume today (CentOS 
>>> 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is 
>>> just sitting at 0.00 for 'run time in secs', and shows 0 files scanned, 
>>> failed, or skipped.
>>>
>>>
>>> I've reviewed the rebalance log for the affected server, and I'm seeing 
>>> these messages:
>>>
>>>
>>> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] 
>>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 
>>> (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 
>>> --xlator-option *dht.use-readdirp=yes --xlator-option 
>>> *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes 
>>> --xlator-option *replicate*.data-self-heal=off --xlator-option 
>>> *replicate*.metadata-self-heal=off --xlator-option 
>>> *replicate*.entry-self-heal=off --xlator-option 
>>> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
>>> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
>>> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
>>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
>>>  --pid-file 
>>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>>  -l /var/log/glusterfs/bigdata2-rebalance.log)
>>> [2015-06-03 15:34:32.704217] E [MSGID: 100018] 
>>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile 
>>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>>  lock failed [Resource temporarily unavailable]
>>>
>>>
>>> I initially investigated the first warning, readv on 
>>> 127.0.0.1:24007<http://127.0.0.1:24007> failed. netstat shows that ip/port 
>>> belonging to a glusterd process. Beyond that I wasn't able to tell why 
>>> there would be a problem.
>>>
>>>
>>> Next, I checked out what was up with the lock file that reported resource 
>>> temprarily unavailable. The file is present and contains the pid of a 
>>> running glusterd process:
>>>
>>>
>>> root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s 
>>> localhost --volfile-id rebalance/bigdata2 --xlator-option 
>>> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes 
>>> --xlator-option *dht.assert-no-child-down=yes --xlator-option 
>>> *replicate*.data-self-heal=off --xlator-option 
>>> *replicate*.metadata-self-heal=off --xlator-option 
>>> *replicate*.entry-self-heal=off --xlator-option 
>>> *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on 
>>> --xlator-option *dht.rebalance-cmd=1 --xlator-option 
>>> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file 
>>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
>>>  --pid-file 
>>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>>  -l /var/log/glusterfs/bigdata2-rebalance.log
>>>
>>>
>>> Finally, one other thing I saw from running 'gluster volume status 
>>> <volname> clients' is that the affected server is the only one of the three 
>>> that lists a 127.0.0.1<http://127.0.0.1>:<port> client for each of it's 
>>> bricks. I don't know why there would be a client coming from loopback on 
>>> the server, but it seems strange. Additionally, it makes me wonder if the 
>>> fact that I have auth.allow set to a single subnet (that doesn't include 
>>> 127.0.0.1) is causing this problem for some reason, or if loopback is 
>>> implicitly allowed to connect.
>>>
>>>
>>> Any tips or suggestions would be much appreciated. Thanks!
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> [email protected]<mailto:[email protected]>
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>> --
>> ~Atin
>> _______________________________________________
>> Gluster-users mailing list
>> [email protected]<mailto:[email protected]>
>> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://www.gluster.org/mailman/listinfo/gluster-users
> 

-- 
~Atin
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One host won't rebalance

Reply via email to