Hi, everyone. We are using GlusterFS configured in the following way:
[root@web1 ~]# gluster volume info Volume Name: share Type: Replicate Volume ID: hidden data on purpose Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: c10839:/gluster Brick2: c10840:/gluster Brick3: web3:/gluster Options Reconfigured: cluster.consistent-metadata: on performance.readdir-ahead: on nfs.disable: true cluster.self-heal-daemon: on cluster.metadata-self-heal: on auth.allow: hidden data on purpose performance.cache-size: 256MB performance.io-thread-count: 8 performance.cache-refresh-timeout: 3 Here is the output of the status command for the volume and the peers: [root@web1 ~]# gluster volume status Status of volume: share Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick c10839:/gluster 49152 0 Y 540 Brick c10840:/gluster 49152 0 Y 533 Brick web3:/gluster 49152 0 Y 782 Self-heal Daemon on localhost N/A N/A Y 602 Self-heal Daemon on web3 N/A N/A Y 790 Self-heal Daemon on web4 N/A N/A Y 636 Self-heal Daemon on web2 N/A N/A Y 523 Task Status of Volume share ------------------------------------------------------------------------------ There are no active volume tasks [root@web1 ~]# gluster peer status Number of Peers: 3 Hostname: web3 Uuid: b138b4d5-8623-4224-825e-1dfdc3770743 State: Peer in Cluster (Connected) Hostname: web2 Uuid: b3926959-3ae8-4826-933a-4bf3b3bd55aa State: Peer in Cluster (Connected) Other names: c10840.sgvps.net Hostname: web4 Uuid: f7553cba-c105-4d2c-8b89-e5e78a269847 State: Peer in Cluster (Connected) All in all, we have three servers that are servers and actually store the data and one server which is just a peer and is connected to one of the other servers. *The Problem*: If any of the 4 servers goes down then the cluster continues to work as expected. However, once this server comes back up then the whole cluster stalls for a certain period of time (30-120 seconds). During this period no I/O operations could be executed and the apps that use the data on the GlusterFS simply go down because they cannot read/write any data. We suspect that the issue is related to the self-heal daemons but we are not sure. Could you please advice how to debug this issue and what could be causing the whole cluster to go down. If it is the self-heal as we suspect do you think it is ok to disable it. If some of the settings are causing this problem could you please advice how to configure the cluster to avoid this problem. If any info from the logs is requested please let us know what do you need. Thanks in advance! Regards, Daniel
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
