On Sun, Jul 20, 2008 at 08:40:19AM -0400, Mag Gam wrote: >I am trying to understand. What was the problem? How does SD_IOSTATS >affect the crash? How did you disable this?
the comments describe the bug: https://bugzilla.lustre.org/show_bug.cgi?id=16404#c22 which from a quick look seems like a SMP locking issue around the statistics collection issue that presumable under some circumstances can cause an overflow and a crash. the way to disable it is to rebuild the patched-by-Lustre RHEL kernel with the CONFIG_SD_IOSTATS option turned off. >Sorry for a newbie question.... no probs. let me know if you need a recipe for patching and rebuilding this kernel. I should really write it all down before I forget anyway... there are most likely descriptions for patching and building kernels on the Lustre wiki too. cheers, robin > > >On Sun, Jul 20, 2008 at 4:54 AM, Robin Humble ><[EMAIL PROTECTED]> wrote: >> On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote: >>>On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote: >>>> Hi, >>>> >>>> I'm seeing coordinated OSS crashes with Lustre 1.6.5.1. >>>> >>>> our RHEL4 OSS have been stable for ~months with these kernels: >>>> kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3 >>>> kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2 >>>> >>>> but have crashed hard, twice, about 10hrs apart as soon as we started >>>> using this kernel: >>>> kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1 >>>Can you try rebuilding the kernel, disabling SD_IOSTATS? >> >> done. I rebuilt using the stock kernel's InfiniBand stack and >> # CONFIG_SD_IOSTATS is not set >> >> % cexec -p oss: uptime >> oss x17: 18:45:07 up 1 day, 30 min, 1 user, load average: 4.97, 7.00, 6.27 >> oss x18: 18:45:07 up 1 day, 23 min, 1 user, load average: 4.18, 5.78, 5.71 >> oss x19: 18:45:07 up 1 day, 23 min, 1 user, load average: 5.18, 5.66, 4.60 >> >> which is >> the 10hrs it was crashing at before. >> good guess about the cause of the problem! :-) >> >> maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a >> fairly critical issue... :-/ >> >> cheers, >> robin >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss@lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss