Hello Hoping I'm in the good mailing list. I've a problem with ofed 1.4.2 on Centos 5.3.
We have a new cluster with QDR infiniband. I've installed ofed from source using the install.pl script with the default values. I've used default kernel from Centos (2.6.18-128.7.1.el5) When a node starts, openibd and opensmd services are launched. Infiniband is working ibv_devinfo hca_id: mlx4_0 fw_ver: 2.6.000 node_guid: 0002:c903:0004:3efc sys_image_guid: 0002:c903:0004:3eff vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xA0 board_id: MT_0C40110009 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 9 port_lid: 17 port_lmc: 0x00 If I launch MPI program, eg vasp, it works using infiniband transport and the performance is good. So no problem until I want to launch a program not using infiniband: Gaussian. With some big calculus and with %ncpu=8, Gaussian abort with the following message ntrbks: Input/output error I launched several times Gaussian and it always aborted at the same point. If I launch the same gaussian on the same input file on our old cluster (same Centos 5.3, same kernel, but without infiniband), it works. Searching the source code for ntrbks shows a call to fstatfs. So I've straced Gaussian on the two clusters. Here is the relevant part. New cluster: [pid 5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe", ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200", "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0", "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...], [/* 65 vars */] PANIC: attached pid 5816 exited with 0 [pid 5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5 [pid 5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=13562292, f_bfree=12353558, f_bavail=12353558, f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255, f_frsize =32768}) = 0 [pid 5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8 [pid 5715] read(5, "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 320032) = 320032 [pid 5715] fstatfs(5, 0x7fff553be6d0) = -1 EIO (Input/output error) Old cluster: [pid 8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe", ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200", "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8 2 vars */]PANIC: attached pid 8701 exited with 0 [pid 8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5 [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232, f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 [pid 8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8 [pid 8605] read(5, "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0 \0"..., 320032) = 320032 [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232, f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232, f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 [pid 8605] write(5, "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0 \0"..., 320032) = 320032 [pid 8605] close(5) = 0 If I put the input file on a local directory instead of a nfs one, Gaussian works. There is no messages in dmesg or in /var/log directory on the node or on the nfs server. On the node, /home is mounted as 192.168.1.100:/home on /home type nfs (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100) 192.168.1.xxx is the ethernet network (the nfs server has not infiniband card). On the node, it is enough to do «ifconfig ib0 down service opensmd stop » to have Gaussian working on the nfs directory. («ifconfig ib0 down» or «service opensmd stop» alone is not enough) So it seems there is an interaction between nfs access and openfabric. But why ? And how to solve it ? Thanks in advance Fabrice BOYRIE _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general