Hi,

Digging in the archives of this list and bugzilla, it seems that the problem 
I'm about to describe has existed for a long time. However, I am unclear if a 
solution was found or not, so I'd like to get some input from the users mailing 
list.

For a volume with a very large number of files (several millions), following an 
outage from a node or if we replace a brick and present it empty to the 
cluster, the self-heal system kicks which is the expected behaviour.

However, during this self-heal, system load is so high that it renders the 
machine unavailable for several hours until it's complete. On certain extreme 
occasions, it goes so far as to prevent SSH login, and at some point we even 
had to force a reboot to recover a minimum of usability.

Has anyone found a way to control the load of the self-heal system to a more 
acceptable level? It is my understanding that the issue is caused by the very 
large number of IOPS required by every brick to enumerate all files and read 
metadata flags, then copy data and write changes. The machines are quite 
capable of heavy IO, since disks are all SSDs in raid-0 and multiple network 
links are bonded per machine for more bandwidth.

I don't mind the time it takes to heal, I mind the impact healing has over 
other operations.

Any ideas?

Thanks

Laurent Chouinard
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to