Re: [Gluster-users] Failed rebalance - lost files, inaccessible files, permission issues

Shawn Heisey Tue, 12 Nov 2013 17:26:08 -0800

On 11/9/2013 2:39 AM, Shawn Heisey wrote:

They are from the same log file - the one that I put on my dropbox
account and linked in the original message.  They are consecutive log
entries.


Further info from our developer that is looking deeper into these problems:



------------

Ouch. I know why the rebalance stopped. The host simply ran out ofmemory. From the messages file:


Nov  2 21:55:30 slc01dfs001a kernel: VFS: file-max limit 2438308 reached

Nov 2 21:55:31 slc01dfs001a kernel: automount invoked oom-killer:gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0

Nov  2 21:55:31 slc01dfs001a kernel: automount cpuset=/ mems_allowed=0

Nov 2 21:55:31 slc01dfs001a kernel: Pid: 2810, comm: automount Nottainted 2.6.32-358.2.1.el6.centos.plus.x86_64 #1

That "file max limit" line actually goes back to the beginning of Nov.2, and happened on all four hosts. It is because of a file descriptorleak and was fixed in 3.3.2:https://bugzilla.redhat.com/show_bug.cgi?id=928631

This is unconnected to the file corruption/loss which started muchearlier. I'm still trying to understand this part. I noticed thatthree of the hosts reported successful rebalancing on the same day westarted losing files. I am not sure how rebalancing was distributedamong the hosts, and if the load on the other hosts was enough to keepthings stable until they stopped.

------------

I gather that we should be at least on 3.3.2, but I suspect that anumber of other bugs might be a problem unless we go to 3.4.1. Therebalance status output is below. All hosts except "localhost" on thisstatus were reading "completed" a very short time after I started therebalance. The "localhost" line continued to increment until therebalance died four days after starting.


[root@slc01dfs001a ~]# gluster volume rebalance mdfs status

Node Rebalanced-files sizescanned failures status--------- ----------- ---------------------- ----------- ------------localhost 1121514 1.5TB9020514 1777661 failedslc01nas1 0 0Bytes13638699 0 completedslc01dfs002a 0 0Bytes13638699 1 completedslc01dfs001b 0 0Bytes13638699 0 completedslc01dfs002b 0 0Bytes13638700 0 completedslc01nas2 0 0Bytes13638699 0 completed


Thanks,
Shawn

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Failed rebalance - lost files, inaccessible files, permission issues

Reply via email to