Brock Palen wrote: > How common is it for servers to go NOT HEALTHY? I feel it is > happening much more often than it should be with us. A few times a > month. > It should not happen at all, in the normal case. It indicates a problem.
> If this happens, we reboot the servers. Should we do something > else? Maybe it depends on what the problem was? Well, determining what the actual problem that caused the NOT HEALTHY would be quite useful, yes. I would not just reboot. -Examine consoles of _all_ servers for any error indications - Examine syslogs of _all_ servers for any LustreErrors or LBUG - Check network and hardware health. Are your disks happy? Is your network dropping packets? Try to figure out what was happening on the cluster. Does this relate to a specific user workload or system load condition? Can you reproduce the situation? Does it happen at a specific time of day, time of month? > > If we should not be getting NOT HEALTHY that often, what information > should I collect to report to CFS? The lustre-diagnostics package is good start for general system config. Beyond that, most of what we would need is listed above. cliffw > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss