Hi Brian, we got the following messages when starting IB:
Jul 31 15:22:55 doss1 kernel: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) Jul 31 15:22:55 doss1 kernel: ib_mthca: Initializing 0000:20:00.0 Jul 31 15:22:55 doss1 kernel: GSI 24 sharing vector 0x92 and IRQ 24 Jul 31 15:22:55 doss1 kernel: ACPI: PCI Interrupt 0000:20:00.0[A] -> GSI 24 (level, low) -> IRQ 146 Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: HCA FW version 3.1.000 is old (3.5.000 is current). Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: If you have problems, try updating your HCA FW. Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: NOP command failed to generate interrupt (IRQ 170). Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: Trying again with MSI/MSI-X disabled. Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) So we updated the HCA FW and it resolved the problem. Now IB is working. How about the 2nd issue? http://lists.lustre.org/pipermail/lustre-discuss/2008-June/007767.html Are there any news? Thank you and Best regards, Danny Brian J. Murrell wrote: > On Thu, 2008-07-31 at 16:08 +0200, Danny Sternkopf wrote: >> Hi, >> >> installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and >> if I start OpenIB the server crashes. It also can't be rebooted anymore >> until the kernel-ib RPM is deinstalled. > > That sounds very suspect. > >> Did anybody get it running? > > Most certainly our QA department had it all running before we released > it. > > I suspect that you have some other problem masquerading itself as a > problem with the OFED stack. > > I'm afraid there is not much we can do to help you without seeing some > logs or error messages or the like. You might have to instrument your > boot with some debugging to see where it's really getting stuck. > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
