Hi everyone,

We've been using 3.3.2 for a while, and recently started to migrate to 3.4.2. 
We run on platform CentOS 6.5 for 3.4.2 (while 3.3.2 were installed on CentOS 
6.4)

Recently, we've have a very scary condition happen and we do not know exactly 
the cause of it.

We have a 3 nodes cluster with a replication factor of 3. Each node has one 
brick, which is made out of one RAID0 volume, comprised of multiple SSDs.

Following some read/write errors, nodes 2 and 3 have completely locked. Nothing 
could be done physically (nothing on the screen, nothing by SSH), physical 
power cycle had to be done. Node 1 was still accessible, but its fuse client 
rejected most if not all reads and writes.

Has anyone experienced something similar?

Before the system freeze, the last thing the kernel seemed to be doing is 
killing HTTPD threads (INFO: task httpd:7910 blocked for more than 120 
seconds.)  End-users talk to Apache in order to read/write from the Gluster 
volume, so it seems a simple case of "something wrong" with gluster which locks 
read/writes, and eventually the kernel kills them.

At this point, we're unsure where to look. Nothing very specific can be found 
in the logs, but perhaps if someone has pointers of what to look for, that 
could give us a new search track.

Thanks

Laurent Chouinard
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to