Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Mohammed Rafi K C Fri, 28 Apr 2017 02:42:06 -0700

Can you share the glusterd logs from the three nodes ?


Rafi KC


On 04/28/2017 02:34 PM, Seva Gluschenko wrote:
> Dear Community,
>
>
> I call for your wisdom, as it appears that googling for keywords doesn't help 
> much.
>
> I have a glusterfs volume with replica count 2, and I tried to perform the 
> online upgrade procedure described in the docs 
> (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It 
> all went almost fine when I'd done with the first replica, the only problem 
> was the self-heal procedure that refused to complete until I commented out 
> all IPv6 entries in the /etc/hosts.
>
> So far, being sure that it all should work on the 2nd replica pretty the same 
> as it was on the 1st one, I had proceeded with the upgrade on the replica 2. 
> All of a sudden, it told me that it doesn't see the first replica at all. The 
> state before upgrade was:
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick sst0:/var/glusterfs                   49152     0          Y       3482 
> Brick sst2:/var/glusterfs                   49152     0          Y       29863
> NFS Server on localhost                   2049      0          Y       25175
> Self-heal Daemon on localhost        N/A       N/A        Y       25283
> NFS Server on sst0                          N/A       N/A        N       N/A  
> Self-heal Daemon on sst0                N/A       N/A        Y       4827 
> NFS Server on sst1                          N/A       N/A        N       N/A  
> Self-heal Daemon on sst1                N/A       N/A        Y       15009
>  
> Task Status of Volume gv0
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Peer in Cluster (Connected)
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> sst2# gluster volume heal gv0 info
> Brick sst0:/var/glusterfs
> Number of entries: 0
>
> Brick sst2:/var/glusterfs
> Number of entries: 0
>
>
> After upgrade, it looked like this:
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick sst2:/var/glusterfs                   N/A       N/A        N       N/A  
> NFS Server on localhost                     N/A       N/A        N       N/A  
> NFS Server on localhost                     N/A       N/A        N       N/A  
>  
> Task Status of Volume gv0
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Peer Rejected (Connected)
>
>
> My biggest fault probably, at that point I googled and found this article 
> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>  -- and followed its advice, removing at sst2 all the /var/lib/glusterd 
> contents except the glusterd.info file. As the result, the node, predictably, 
> lost all information about the volume.
>
> sst2# gluster volume status
> No volumes present
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Accepted peer request (Connected)
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Accepted peer request (Connected)
>
> Okay, I thought, this is might be a high time to re-add the brick. Not that 
> easy, Jack:
>
> sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs'
> volume add-brick: failed: Operation failed
>
> The reason appeared to be natural: sst0 still knows that there was the 
> replica on sst2. What should I do then? At this point, I tried to recover the 
> volume information on sst2 by putting it offline and copying all the volume 
> info from the sst0. Of course it wasn't enough to just copy as is, I modified 
> /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for 
> the remote brick (sst0) and listen-port=49152 for the local brick (sst2). It 
> didn't help much, unfortunately. The final state I've reached is as follows:
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Sent and Received peer request (Connected)
>
> sst2# gluster volume info
>  
> Volume Name: gv0
> Type: Replicate
> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: sst0:/var/glusterfs
> Brick2: sst2:/var/glusterfs
> Options Reconfigured:
> cluster.self-heal-daemon: enable
> performance.readdir-ahead: on
> storage.owner-uid: 1000
> storage.owner-gid: 1000
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick sst2:/var/glusterfs                   N/A       N/A        N       N/A  
> NFS Server on localhost                     N/A       N/A        N       N/A  
> NFS Server on localhost                     N/A       N/A        N       N/A  
>  
> Task Status of Volume gv0
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> Meanwhile, on sst0:
>
> sst0# gluster volume info
>  
> Volume Name: gv0
> Type: Replicate
> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: sst0:/var/glusterfs
> Brick2: sst2:/var/glusterfs
> Options Reconfigured:
> storage.owner-gid: 1000
> storage.owner-uid: 1000
> performance.readdir-ahead: on
> cluster.self-heal-daemon: enable
>
> sst0 ~ # gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick sst0:/var/glusterfs                   49152     0          Y       31263
> NFS Server on localhost                     N/A       N/A        N       N/A  
> Self-heal Daemon on localhost               N/A       N/A        Y       31254
>  
> Task Status of Volume gv0
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> Any ideas how to bring the sst2 back to normal are appreciated. As a last 
> resort solution, I can schedule the downtime, backup data, kill the volume 
> and start all over, but I would like to know if there is a shorter path. 
> Thank you very much in advance.
>
> -- 
> Best Regards,
>
> Seva Gluschenko
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Reply via email to