On 11/9/2013 2:39 AM, Shawn Heisey wrote:
They are from the same log file - the one that I put on my dropbox
account and linked in the original message.  They are consecutive log
entries.

Further info from our developer that is looking deeper into these problems:



------------
Ouch. I know why the rebalance stopped. The host simply ran out of memory. From the messages file:

Nov  2 21:55:30 slc01dfs001a kernel: VFS: file-max limit 2438308 reached
Nov 2 21:55:31 slc01dfs001a kernel: automount invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
Nov  2 21:55:31 slc01dfs001a kernel: automount cpuset=/ mems_allowed=0
Nov 2 21:55:31 slc01dfs001a kernel: Pid: 2810, comm: automount Not tainted 2.6.32-358.2.1.el6.centos.plus.x86_64 #1

That "file max limit" line actually goes back to the beginning of Nov. 2, and happened on all four hosts. It is because of a file descriptor leak and was fixed in 3.3.2: https://bugzilla.redhat.com/show_bug.cgi?id=928631

This is unconnected to the file corruption/loss which started much earlier. I'm still trying to understand this part. I noticed that three of the hosts reported successful rebalancing on the same day we started losing files. I am not sure how rebalancing was distributed among the hosts, and if the load on the other hosts was enough to keep things stable until they stopped.
------------



I gather that we should be at least on 3.3.2, but I suspect that a number of other bugs might be a problem unless we go to 3.4.1. The rebalance status output is below. All hosts except "localhost" on this status were reading "completed" a very short time after I started the rebalance. The "localhost" line continued to increment until the rebalance died four days after starting.

[root@slc01dfs001a ~]# gluster volume rebalance mdfs status
Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 1121514 1.5TB 9020514 1777661 failed slc01nas1 0 0Bytes 13638699 0 completed slc01dfs002a 0 0Bytes 13638699 1 completed slc01dfs001b 0 0Bytes 13638699 0 completed slc01dfs002b 0 0Bytes 13638700 0 completed slc01nas2 0 0Bytes 13638699 0 completed

Thanks,
Shawn

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to