I had opened another thread on this mailing list (Subject: "After upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and split-brain").
The title may be a bit misleading now, as I am no longer observing high CPU usage after upgrading to 3.8.6, but the disconnects are still happening and the number of files in split-brain is growing. Setup: 6 compute nodes, each serving as a glusterfs server and client, Ubuntu 14.04, two bricks per node, distribute-replicate I have two gluster volumes set up (one for scratch data, one for the slurm scheduler). Only the scratch data volume shows critical errors "[...] has not responded in the last 42 seconds, disconnecting.". So I can rule out network problems, the gigabit link between the nodes is not saturated at all. The disks are almost idle (<10%). I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, running fine since it was deployed. I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for almost a year. After upgrading to 3.8.5, the problems (as described) started. I would like to use some of the new features of the newer versions (like bitrot), but the users can't run their compute jobs right now because the result files are garbled. There also seems to be a bug report with a smiliar problem: (but no progress) https://bugzilla.redhat.com/show_bug.cgi?id=1370683 For me, ALL servers are affected (not isolated to one or two servers) I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for more than 120 seconds." in the syslog. For completeness (gv0 is the scratch volume, gv2 the slurm volume): [root@giant2: ~]# gluster v info Volume Name: gv0 Type: Distributed-Replicate Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: giant1:/gluster/sdc/gv0 Brick2: giant2:/gluster/sdc/gv0 Brick3: giant3:/gluster/sdc/gv0 Brick4: giant4:/gluster/sdc/gv0 Brick5: giant5:/gluster/sdc/gv0 Brick6: giant6:/gluster/sdc/gv0 Brick7: giant1:/gluster/sdd/gv0 Brick8: giant2:/gluster/sdd/gv0 Brick9: giant3:/gluster/sdd/gv0 Brick10: giant4:/gluster/sdd/gv0 Brick11: giant5:/gluster/sdd/gv0 Brick12: giant6:/gluster/sdd/gv0 Options Reconfigured: auth.allow: X.X.X.*,127.0.0.1 nfs.disable: on Volume Name: gv2 Type: Replicate Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: giant1:/gluster/sdd/gv2 Brick2: giant2:/gluster/sdd/gv2 Options Reconfigured: auth.allow: X.X.X.*,127.0.0.1 cluster.granular-entry-heal: on cluster.locking-scheme: granular nfs.disable: on 2016-11-30 0:10 GMT+01:00 Micha Ober <[email protected]>: > There also seems to be a bug report with a smiliar problem: (but no > progress) > https://bugzilla.redhat.com/show_bug.cgi?id=1370683 > > For me, ALL servers are affected (not isolated to one or two servers) > > I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for more > than 120 seconds." in the syslog. > > For completeness (gv0 is the scratch volume, gv2 the slurm volume): > > [root@giant2: ~]# gluster v info > > Volume Name: gv0 > Type: Distributed-Replicate > Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 > Status: Started > Snapshot Count: 0 > Number of Bricks: 6 x 2 = 12 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdc/gv0 > Brick2: giant2:/gluster/sdc/gv0 > Brick3: giant3:/gluster/sdc/gv0 > Brick4: giant4:/gluster/sdc/gv0 > Brick5: giant5:/gluster/sdc/gv0 > Brick6: giant6:/gluster/sdc/gv0 > Brick7: giant1:/gluster/sdd/gv0 > Brick8: giant2:/gluster/sdd/gv0 > Brick9: giant3:/gluster/sdd/gv0 > Brick10: giant4:/gluster/sdd/gv0 > Brick11: giant5:/gluster/sdd/gv0 > Brick12: giant6:/gluster/sdd/gv0 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > nfs.disable: on > > Volume Name: gv2 > Type: Replicate > Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdd/gv2 > Brick2: giant2:/gluster/sdd/gv2 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > cluster.granular-entry-heal: on > cluster.locking-scheme: granular > nfs.disable: on > > > 2016-11-29 19:21 GMT+01:00 Micha Ober <[email protected]>: > >> I had opened another thread on this mailing list (Subject: "After upgrade >> from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >> split-brain"). >> >> The title may be a bit misleading now, as I am no longer observing high >> CPU usage after upgrading to 3.8.6, but the disconnects are still happening >> and the number of files in split-brain is growing. >> >> Setup: 6 compute nodes, each serving as a glusterfs server and client, >> Ubuntu 14.04, two bricks per node, distribute-replicate >> >> I have two gluster volumes set up (one for scratch data, one for the >> slurm scheduler). Only the scratch data volume shows critical errors "[...] >> has not responded in the last 42 seconds, disconnecting.". So I can rule >> out network problems, the gigabit link between the nodes is not saturated >> at all. The disks are almost idle (<10%). >> >> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >> running fine since it was deployed. >> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for >> almost a year. >> >> After upgrading to 3.8.5, the problems (as described) started. I would >> like to use some of the new features of the newer versions (like bitrot), >> but the users can't run their compute jobs right now because the result >> files are garbled. >> >> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee <[email protected]>: >> >>> Would you be able to share what is not working for you in 3.8.x (mention >>> the exact version). 3.4 is quite old and falling back to an unsupported >>> version doesn't look a feasible option. >>> >>> On Tue, 29 Nov 2016 at 17:01, Micha Ober <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I was using gluster 3.4 and upgraded to 3.8, but that version showed to >>>> be unusable for me. I now need to downgrade. >>>> >>>> I'm running Ubuntu 14.04. As upgrades of the op version >>>> are irreversible, I guess I have to delete all gluster volumes and >>>> re-create them with the downgraded version. >>>> >>>> 0. Backup data >>>> 1. Unmount all gluster volumes >>>> 2. apt-get purge glusterfs-server glusterfs-client >>>> 3. Remove PPA for 3.8 >>>> 4. Add PPA for older version >>>> 5. apt-get install glusterfs-server glusterfs-client >>>> 6. Create volumes >>>> >>>> Is "purge" enough to delete all configuration files of the currently >>>> installed version or do I need to manually clear some residues before >>>> installing an older version? >>>> >>>> Thanks. >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> [email protected] >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>> >>> -- >>> - Atin (atinm) >>> >> >> >
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
