Kevin Van Maren wrote: > It is not clear if you have a library path issue or not, as you > trimmed too much from the strace. I would say not, as if you did you > would get an exec error about not being able to find a shared library, > not "nothing". > > Rather it sounds like the driver is not loading properly. ibstat > should work even w/o a subnet manager running. > > This very much could have been caused by your loading the wrong > firmware for that card. Given that the PSID was different, are you > sure you flashed the right firmware for that card? > > Kevin You're right : mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) mlx4_core: Initializing 0000:03:00.0 mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1 mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5) mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting. mlx4_core: probe of 0000:03:00.0 failed with error -5
Then, i can't see the card with mrtflint : [root at Lidia ~]# lspci | grep Mell 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0) [root at Lidia ~]# mstflint -d 03:00.0 v Warning: memory access to device 03:00.0 failed: Input/output error Warning: Fallback on IO: much slower, and unsafe if device in use. *** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR However, when i'm using ubuntu (liveCD), i can see the card with mstflint: http://img29.imageshack.us/img29/9251/mstflint.png I installed the firmware from here : http://www.mellanox.com/content/pages.php?pg=firmware_table_Sun I've got a SUN0070000001 (375-3549, X4217A-Z) I also think that this firmware is odd. Do you know where i can have the right one ? Thanks, Yann > > > Rhys McMurdo wrote: >> Hi Yann, >> >> Firstly, this probably isn't the best list to ask these questions. >> There is a mailing list for the Linux HPC software stack available at >> linux_hpc_swstack at lists.lustre.org >> <mailto:linux_hpc_swstack at lists.lustre.org> >> >> Secondly, if I had to guess at your problem it looks like either you >> may not have an OpenSM daemon running, or you library paths are not >> right. >> >> Check the opensmd status via /etc/init.d/opensmd status. Also, what >> does ldd /usr/sbin/ibstat show? >> >> Regards, >> >> Rhys >> >> 2009/9/21 Yann JOBIC <jobic at polytech.univ-mrs.fr >> <mailto:jobic at polytech.univ-mrs.fr>> >> >> Hello, >> >> I've got 2 X4600, centos 5.3, the last firmware for 375-3549 cards >> from the mellanix website (for sun cards), and sun hpc software >> for linux. >> >> When i'm running ibstat, in order to check the health of my >> infiniband cards, i've got nothing. >> When i'm running the strace tool to see what happened, i've got : >> >> [root at Lidia ~]# strace ibstat >> execve("/usr/sbin/ibstat", ["ibstat"], [/* 34 vars */]) = 0 >> brk(0) = 0x1bb06000 >> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, >> -1, 0) = 0x2b47a0ae0000 >> uname({sys="Linux", node="Lidia", ...}) = 0 >> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file >> or directory) >> >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64", >> 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls", 0x7fff09fc8b80) = >> -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64", >> 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64", >> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> [....] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmvendor.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmcomp.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibmad.so.1", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibumad.so.1", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> >> It's missing some other files. >> >> When i flashed the firmware, i had this warning : >> >> root at Lilou ~]# mstflint -d 03:00.0 -i >> fw-25408-2_6_000-375-3549-01.bin b >> >> Current FW version on flash: 2.5.100 >> New FW version: 2.6.0 >> >> You are about to replace current PSID on flash - "SUN0070000001" >> with a different PSID - "SUN0070130001". >> Note: It is highly recommended not to change the PSID. >> >> Do you want to continue ? (y/n) [n] : y >> >> Burning second FW image without signatures - OK Restoring second >> signature - OK >> I followed the deployment documentation. Did i miss something ? >> Does anybody had those kind of problems ? >> >> Thanks, >> >> Yann >> >> >> >> -- ___________________________ >> >> Yann JOBIC >> HPC engineer >> Polytech Marseille DME >> IUSTI-CNRS UMR 6595 >> Technop?le de Ch?teau Gombert >> 5 rue Enrico Fermi >> 13453 Marseille cedex 13 >> Tel : (33) 4 91 10 69 39 >> ou (33) 4 91 10 69 43 >> Fax : (33) 4 91 10 69 69 >> _______________________________________________ >> hpcdev-discuss mailing list >> hpcdev-discuss at opensolaris.org >> <mailto:hpcdev-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/hpcdev-discuss >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Linux_hpc_swstack mailing list >> Linux_hpc_swstack at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack >> > -- ___________________________ Yann JOBIC HPC engineer Polytech Marseille DME IUSTI-CNRS UMR 6595 Technop?le de Ch?teau Gombert 5 rue Enrico Fermi 13453 Marseille cedex 13 Tel : (33) 4 91 10 69 39 ou (33) 4 91 10 69 43 Fax : (33) 4 91 10 69 69